Using a deep neural network to speed up a model of loudness for time-varying sounds

  • Josef Schlittenlacher Department of Experimental Psychology, University of Cambridge, Downing Street, Cambridge, CB2 3EB, UK
  • Richard Turner Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, UK
  • Brian C. J. Moore Department of Experimental Psychology, University of Cambridge, Downing Street, Cambridge, CB2 3EB, UK
Keywords: loudness, deep neural network

Abstract

The “time-varying loudness (TVL)” model calculates “instantaneous loudness” every 1 ms, and this is used to generate predictions of short-term loudness, the loudness of a short segment of sound such as a word in a sentence, and of long-term loudness, the loudness of a longer segment of sound, such as a whole sentence. The calculation of instantaneous loudness is computationally intensive and real-time implementation of the TVL model is difficult. To speed up the computation, a deep neural network (DNN) has been trained to predict instantaneous loudness using a large database of speech sounds and artificial sounds (tones alone and tones in white or pink noise), with the predictions of the TVL model as a reference (providing the "correct" answer, specifically the loudness level in phons). A multilayer perceptron with three hidden layers was found to be sufficient, with more complex DNN architecture not yielding higher accuracy. After training, the deviations between the predictions of the TVL model and the predictions of the DNN were typically less than 0.5 phons, even for types of sounds that were not used for training (music, rain, animal sounds, washing machine). The DNN calculates instantaneous loudness over 100 times more quickly than the TVL model.

References

Aibara, R., Welsh, J. T., Puria, S., and Goode, R. L. (2001). “Human middle-ear sound transfer function and cochlear input impedance,” Hear. Res. 152, 100-109. doi: 10.1016/S0378-5955(00)00240-9

Cooper, N. P. (2004). “Compression in the peripheral auditory system,” in Compression: From Cochlea to Cochlear Implants, edited by S. P. Bacon, R. R. Fay, and A. N. Popper (Springer, New York), 18-61.

Glasberg, B. R., and Moore, B. C. J. (1990). “Derivation of auditory filter shapes from notched-noise data,” Hear. Res. 47, 103-138. doi: doi.org/10.1016/0378- 5955(90)90170-T

Glasberg, B. R., and Moore, B. C. J. (2002). “A model of loudness applicable to time- varying sounds,” J. Audio Eng. Soc. 50, 331-342.

Hellman, R. P. (1976). “Growth of loudness at 1000 and 3000 Hz,” J. Acoust. Soc. Am. 60, 672-679. doi: 10.1121/1.381138

Hots, J., Rennies, J., and Verhey, J. L. (2013). “Loudness of sounds with a subcritical bandwidth: A challenge to current loudness models?,” J. Acoust. Soc. Am. 134, EL334-339. doi: 10.1121/1.4820466

ISO 532-3 (2019). Acoustics - Methods for calculating loudness - Part 3: Moore- Glasberg-Schlittenlacher method for time varying sounds (International Organization for Standardization, Geneva), (draft).

Kingma, D. P., and Ba, J. (2014). “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980

Launer, S., and Moore, B. C. J. (2003). “Use of a loudness model for hearing aid fitting. V. On-line gain control in a digital hearing aid,” Int. J. Audiol. 42, 262- 273. doi: 10.3109/14992020309078345

Marshall, A., and Davies, P. (2007). “A semantic differential study of low amplitude supersonic aircraft noise and other transient sounds,” in International Congress on Acoustics (Madrid), pp. 1-6.

Moore, B. C. J. (2012). An Introduction to the Psychology of Hearing, 6th Ed. (Brill, Leiden, The Netherlands), 1-441.

Moore, B. C. J., and Glasberg, B. R. (1983). “Suggested formulae for calculating auditory-filter bandwidths and excitation patterns,” J. Acoust. Soc. Am. 74, 750- 753. doi: 10.1121/1.4955005

Moore, B. C. J., and Glasberg, B. R. (1997). “A model of loudness perception applied to cochlear hearing loss,” Auditory Neurosci. 3, 289-311.

Moore, B. C. J., and Glasberg, B. R. (2004). “A revised model of loudness perception applied to cochlear hearing loss,” Hear. Res. 188, 70-88. doi: 10.1016/S0378- 5955(03)00347-2

Moore, B. C. J., and Glasberg, B. R. (2007). “Modeling binaural loudness,” J. Acoust. Soc. Am. 121, 1604-1612. doi: 10.1121/1.2431331

Moore, B. C. J., and Oxenham, A. J. (1998). “Psychoacoustic consequences of compression in the peripheral auditory system,” Psych. Rev. 105, 108-124. doi: 10.1037/0033-295X.105.1.108

Moore, B. C. J., Glasberg, B. R., and Baer, T. (1997). “A model for the prediction of thresholds, loudness and partial loudness,” J. Audio Eng. Soc. 45, 224-240.

Moore, B. C. J., Glasberg, B. R., and Stone, M. A. (2003). “Why are commercials so loud? - Perception and modeling of the loudness of amplitude-compressed speech,” J. Audio Eng. Soc. 51, 1123-1132.

Nair, V., and Hinton, G. E. (2010). “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), edited by J. Fürnkranz, and T. Joachims (Haifa, Israel), pp. 807-814.

Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Brisbane, Australia), pp. 5206-5210.

Piczak, K. J. (2015). “ESC: Dataset for environmental sound classification,” in Proceedings of the 23rd ACM International Conference on Multimedia (ACM, Brisbane, Australia), pp. 1015-1018.

Rennies, J., Wächtler, M., Hots, J., and Verhey, J. (2015). “Spectro-temporal characteristics affecting the loudness of technical sounds: data and model predictions,” Acta Acust. united Ac. 101, 1145-1156. doi: 10.3813/AAA.918907

Shaw, E. A., and Vaillancourt, M. M. (1985). “Transformation of sound-pressure level from the free field to the eardrum presented in numerical form,” J. Acoust. Soc. Am. 78, 1120-1123. doi: 10.1121/1.393035

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. (2013). “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199

Thwaites, A., Glasberg, B. R., Nimmo-Smith, I., Marslen-Wilsen, W. D., and Moore, B. C. J. (2016). “Representation of instantaneous and short-term loudness in the human cortex,” Front. Neurosci. 10, article 183, 1-11. doi: 10.3389/fnins.2016.00183

Zorila, T.-C., Stylianou, Y., Flanagan, S., and Moore, B. C. J. (2016). “Effectiveness of a loudness model for time-varying sounds in equating the loudness of sentences subjected to different forms of signal processing,” J. Acoust. Soc. Am. 140, 402- 408. doi: 10.1121/1.4955005

Zwicker, E., and Scharf, B. (1965). “A model of loudness summation,” Psych. Rev. 72, 3-26. doi: 10.1037/h0021703

Published
2020-04-18
How to Cite
Schlittenlacher, J., Turner, R., & Moore, B. (2020). Using a deep neural network to speed up a model of loudness for time-varying sounds. Proceedings of the International Symposium on Auditory and Audiological Research, 7, 133-140. Retrieved from https://proceedings.isaar.eu/index.php/isaarproc/article/view/2019-16
Section
2019/3. Machine listening and intelligent auditory signal processing