Applying physiologically-motivated models of auditory processing to automatic speech recognition
Abstract
For many years the human auditory system has been an inspiration for devel- opers of automatic speech recognition systems because of its ability to inter- pret speech accurately in a wide variety of difficult acoustical environments. This paper discusses the application of physiologically-motivated approaches to signal processing that facilitate robust automatic speech recognition in en- vironments with additive noise and reverberation. We review selected aspects of auditory processing that are believed to be especially relevant to speech perception, “classic” auditory models of the 1980s, the application of con- temporary auditory-based signal processing approaches to practical automatic speech recognition systems, and the impact of these models on speech recog- nition accuracy in degraded acoustical environments.
References
Athieos, M. and Ellis, D. P. W. (2003). “Frequency-domain linear prediction for tem- poral features”, in Proc. IEEE ASRU Workshop, 261–266.
Bourlard, H., Dupont, S., Hermansky, H., and Morgan, N. (1996). “Towards sub- band-based speech recognition”, in Proc. European Signal Processing Conference, 1579–1582.
Chi, T., Ru, R., and Shamma, S. A. (2005). “Multiresolution spectrotemporal analysis of complex sounds”, J. Acoustic. Soc. Amer. 118, 887–906.
Davis, S. B. and Mermelstein, P. (1980). “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”, IEEE Transactions on Acoustics, Speech and Signal Processing 28, 357–366.
European Telecommunications Standards Institute (ETSI) (2007). “Speech process- ing, transmission and quality aspects (STQ); distributed speech recognition; ad- vanced front-end feature extraction algorithm; compression algorithms”, Techni- cal Report ETSI ES 202 050, Rev. 1.1.5.
Ghitza, O. (1986). “Auditory nerve representation as a front-end for speech recogni- tion in a noisy environment”, Computer Speech and Language 1, 109–130.
Heinz, M. G., Zhang, X., Bruce, I. C., and Carney, L. H. (2001). “Auditory-nerve model for predicting performance limits of normal and impaired listeners”, Acoustics Research Letters Online 2, 91–96.
Hermansky, H. (1990). “Perceptual linear predictive (PLP) anlysis of speech”, J. Acoustic. Soc. Amer. 87, 1738–1752.
Hermansky, H., Ellis, D. P. W., and Sharma, S. (2000). “Tandem connectionist feature extraction for conventional hmm systems”, in Proc. IEEE ICASSP, 1635–1638.
Hermansky, H. and Morgan, N. (1994). “RASTA processing of speech”, IEEE Trans- actions on Speech and Audio Processing 2, 578–589.
Hermansky, H. and Sharma, S. (1999). “Temporal patterns (TRAPS) in ASR of noisy speech”, in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing.
Kil, R. M., Lee, S., and Kim, D. (1999). “Auditory proccessing of speech signals for robust speech recognition in real world noisy environments”, IEEE Trans. on Speech and Audio Processing 7, 55–59.
Kim, C., Chiu, Y.-H., and Stern, R. M. (2006). “Physiologically-motivated synchrony- based processing for robust automatic speech recognition”, in Proc. Interspeech, 1975–1978.
Kim, C. and Stern, R. M. (2009). “Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction”, in Proc. Interspeech, 28–31.
Kim, C. and Stern, R. M. (2010). “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing Conf. on Acoustics, Speech, and Signal Processing, 4574–4577.
Kim, C. and Stern, R. M. (2012). “Power-normalized cepstral coefficients (PNCC) for robust speech recognition”, IEEE Trans. on Audio, Speech, and Language Proc. (accepted for pubication) .
Kingsbury, B. E. D., Morgan, N., and Greenberg, S. (1998). “Robust speech recognition using the modulation spectrogram”, Speech Communication 25, 117–132. Kleinschmidt, M. (2003). “Localized spectro-temporal features for automatic speech recognition”, in Proc. Eurospeech, 2573–2576.
Lyon, R. F. (1982). “A computational model of filtering, detection and compression in the cochlea”, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1282–1285 (Paris).
Mesgarani, N., Slaney, M., and Shamma, S. A. (2006). “Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations”, IEEE Trans. on Audio, Speech, and Language Proc. 14, 920–929.
Moore, B. C. J. (2003). An Introduction to the Psychology of Hearing, fifth edition (Academic Press, London).
Moreno, P. J., Raj, B., and Stern, R. M. (1996). “A vector taylor series approach for environment-independent speech recognition”, in Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, 733–736.
Pickles, J. O. (2008). An Introduction to the Physiology of Hearing, 3 edition (Academic Press).
Rabiner, L. R. and Juang, B.-H. (1993). Fundamentals of Speech Recognition (Prentice-Hall).
Ravuri, S. (2011). “On the use of spectro-temporal features in noise-additive speech”, Master’s thesis, University of California, Berkeley.
Seneff, S. (1988). “A joint synchrony/mean-rate model of auditory speech processing”, J. Phonetics 15, 55–76.
Tchorz, J. and Kollmeier, B. (1999). “A model of auditory perception as front end for automatic speech recognition”, J. Acoustic. Soc. Amer. 106, 2040–2060.
Wang, K. and Shamma, S. A. (1994). “Self-normalization and noise-robustness in early auditory representations”, IEEE Trans. on Speech and Audio Processing 2, 421–435.
Yost, W. A. (2006). Fundamentals of Hearing: An Introduction, 5 edition (Emerald Group Publishing).
Young, E. D. and Sachs, M. B. (1979). “Representation of steady-state vowels in the temporal aspects of the discharge patterns of populations of auditory-nerve fibers”, J. Acoustic. Soc. Amer. 66, 1381–1403.
Zhang, X., Heinz, M. G., Bruce, I. C., and Carney, L. H. (2001). “A phenomenologial model for the response of auditory-nerve fibers: I. nonlinear tuning with compression and suppresion”, Journal of the Acoustical Society of America 109, 648–670.
Additional Files
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright* and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
*From the 2017 issue onward. The Danavox Jubilee Foundation owns the copyright of all articles published in the 1969-2015 issues. However, authors are still allowed to share the work with an acknowledgement of the work's authorship and initial publication in this journal.