Using a deep neural network to speed up a model of loudness for time-varying sounds
Keywords:loudness, deep neural network
The “time-varying loudness (TVL)” model calculates “instantaneous loudness” every 1 ms, and this is used to generate predictions of short-term loudness, the loudness of a short segment of sound such as a word in a sentence, and of long-term loudness, the loudness of a longer segment of sound, such as a whole sentence. The calculation of instantaneous loudness is computationally intensive and real-time implementation of the TVL model is difficult. To speed up the computation, a deep neural network (DNN) has been trained to predict instantaneous loudness using a large database of speech sounds and artificial sounds (tones alone and tones in white or pink noise), with the predictions of the TVL model as a reference (providing the "correct" answer, specifically the loudness level in phons). A multilayer perceptron with three hidden layers was found to be sufficient, with more complex DNN architecture not yielding higher accuracy. After training, the deviations between the predictions of the TVL model and the predictions of the DNN were typically less than 0.5 phons, even for types of sounds that were not used for training (music, rain, animal sounds, washing machine). The DNN calculates instantaneous loudness over 100 times more quickly than the TVL model.
Aibara, R., Welsh, J. T., Puria, S., and Goode, R. L. (2001). “Human middle-ear sound transfer function and cochlear input impedance,” Hear. Res. 152, 100-109. doi: 10.1016/S0378-5955(00)00240-9
Cooper, N. P. (2004). “Compression in the peripheral auditory system,” in Compression: From Cochlea to Cochlear Implants, edited by S. P. Bacon, R. R. Fay, and A. N. Popper (Springer, New York), 18-61.
Glasberg, B. R., and Moore, B. C. J. (1990). “Derivation of auditory filter shapes from notched-noise data,” Hear. Res. 47, 103-138. doi: doi.org/10.1016/0378- 5955(90)90170-T
Glasberg, B. R., and Moore, B. C. J. (2002). “A model of loudness applicable to time- varying sounds,” J. Audio Eng. Soc. 50, 331-342.
Hellman, R. P. (1976). “Growth of loudness at 1000 and 3000 Hz,” J. Acoust. Soc. Am. 60, 672-679. doi: 10.1121/1.381138
Hots, J., Rennies, J., and Verhey, J. L. (2013). “Loudness of sounds with a subcritical bandwidth: A challenge to current loudness models?,” J. Acoust. Soc. Am. 134, EL334-339. doi: 10.1121/1.4820466
ISO 532-3 (2019). Acoustics - Methods for calculating loudness - Part 3: Moore- Glasberg-Schlittenlacher method for time varying sounds (International Organization for Standardization, Geneva), (draft).
Kingma, D. P., and Ba, J. (2014). “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980
Launer, S., and Moore, B. C. J. (2003). “Use of a loudness model for hearing aid fitting. V. On-line gain control in a digital hearing aid,” Int. J. Audiol. 42, 262- 273. doi: 10.3109/14992020309078345
Marshall, A., and Davies, P. (2007). “A semantic differential study of low amplitude supersonic aircraft noise and other transient sounds,” in International Congress on Acoustics (Madrid), pp. 1-6.
Moore, B. C. J. (2012). An Introduction to the Psychology of Hearing, 6th Ed. (Brill, Leiden, The Netherlands), 1-441.
Moore, B. C. J., and Glasberg, B. R. (1983). “Suggested formulae for calculating auditory-filter bandwidths and excitation patterns,” J. Acoust. Soc. Am. 74, 750- 753. doi: 10.1121/1.4955005
Moore, B. C. J., and Glasberg, B. R. (1997). “A model of loudness perception applied to cochlear hearing loss,” Auditory Neurosci. 3, 289-311.
Moore, B. C. J., and Glasberg, B. R. (2004). “A revised model of loudness perception applied to cochlear hearing loss,” Hear. Res. 188, 70-88. doi: 10.1016/S0378- 5955(03)00347-2
Moore, B. C. J., and Glasberg, B. R. (2007). “Modeling binaural loudness,” J. Acoust. Soc. Am. 121, 1604-1612. doi: 10.1121/1.2431331
Moore, B. C. J., and Oxenham, A. J. (1998). “Psychoacoustic consequences of compression in the peripheral auditory system,” Psych. Rev. 105, 108-124. doi: 10.1037/0033-295X.105.1.108
Moore, B. C. J., Glasberg, B. R., and Baer, T. (1997). “A model for the prediction of thresholds, loudness and partial loudness,” J. Audio Eng. Soc. 45, 224-240.
Moore, B. C. J., Glasberg, B. R., and Stone, M. A. (2003). “Why are commercials so loud? - Perception and modeling of the loudness of amplitude-compressed speech,” J. Audio Eng. Soc. 51, 1123-1132.
Nair, V., and Hinton, G. E. (2010). “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), edited by J. Fürnkranz, and T. Joachims (Haifa, Israel), pp. 807-814.
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Brisbane, Australia), pp. 5206-5210.
Piczak, K. J. (2015). “ESC: Dataset for environmental sound classification,” in Proceedings of the 23rd ACM International Conference on Multimedia (ACM, Brisbane, Australia), pp. 1015-1018.
Rennies, J., Wächtler, M., Hots, J., and Verhey, J. (2015). “Spectro-temporal characteristics affecting the loudness of technical sounds: data and model predictions,” Acta Acust. united Ac. 101, 1145-1156. doi: 10.3813/AAA.918907
Shaw, E. A., and Vaillancourt, M. M. (1985). “Transformation of sound-pressure level from the free field to the eardrum presented in numerical form,” J. Acoust. Soc. Am. 78, 1120-1123. doi: 10.1121/1.393035
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. (2013). “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199
Thwaites, A., Glasberg, B. R., Nimmo-Smith, I., Marslen-Wilsen, W. D., and Moore, B. C. J. (2016). “Representation of instantaneous and short-term loudness in the human cortex,” Front. Neurosci. 10, article 183, 1-11. doi: 10.3389/fnins.2016.00183
Zorila, T.-C., Stylianou, Y., Flanagan, S., and Moore, B. C. J. (2016). “Effectiveness of a loudness model for time-varying sounds in equating the loudness of sentences subjected to different forms of signal processing,” J. Acoust. Soc. Am. 140, 402- 408. doi: 10.1121/1.4955005
Zwicker, E., and Scharf, B. (1965). “A model of loudness summation,” Psych. Rev. 72, 3-26. doi: 10.1037/h0021703
How to Cite
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright* and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
*From the 2017 issue onward. The Danavox Jubilee Foundation owns the copyright of all articles published in the 1969-2015 issues. However, authors are still allowed to share the work with an acknowledgement of the work's authorship and initial publication in this journal.