Applying physiologically-motivated models of auditory processing to automatic speech recognition

Richard M. Stern

Forfattere

Richard M. Stern Department of Electrical and Computer Engineering and Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 USA

Resumé

For many years the human auditory system has been an inspiration for devel- opers of automatic speech recognition systems because of its ability to inter- pret speech accurately in a wide variety of difficult acoustical environments. This paper discusses the application of physiologically-motivated approaches to signal processing that facilitate robust automatic speech recognition in en- vironments with additive noise and reverberation. We review selected aspects of auditory processing that are believed to be especially relevant to speech perception, “classic” auditory models of the 1980s, the application of con- temporary auditory-based signal processing approaches to practical automatic speech recognition systems, and the impact of these models on speech recog- nition accuracy in degraded acoustical environments.

Referencer

Allen, J. B. (1994). “How do humans process and recognize speech?”, IEEE Trans. on Speech and Audio 2, 567–577.

Athieos, M. and Ellis, D. P. W. (2003). “Frequency-domain linear prediction for tem- poral features”, in Proc. IEEE ASRU Workshop, 261–266.

Bourlard, H., Dupont, S., Hermansky, H., and Morgan, N. (1996). “Towards sub- band-based speech recognition”, in Proc. European Signal Processing Conference, 1579–1582.

Chi, T., Ru, R., and Shamma, S. A. (2005). “Multiresolution spectrotemporal analysis of complex sounds”, J. Acoustic. Soc. Amer. 118, 887–906.

Davis, S. B. and Mermelstein, P. (1980). “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”, IEEE Transactions on Acoustics, Speech and Signal Processing 28, 357–366.

European Telecommunications Standards Institute (ETSI) (2007). “Speech process- ing, transmission and quality aspects (STQ); distributed speech recognition; ad- vanced front-end feature extraction algorithm; compression algorithms”, Techni- cal Report ETSI ES 202 050, Rev. 1.1.5.

Ghitza, O. (1986). “Auditory nerve representation as a front-end for speech recogni- tion in a noisy environment”, Computer Speech and Language 1, 109–130.

Heinz, M. G., Zhang, X., Bruce, I. C., and Carney, L. H. (2001). “Auditory-nerve model for predicting performance limits of normal and impaired listeners”, Acoustics Research Letters Online 2, 91–96.

Hermansky, H. (1990). “Perceptual linear predictive (PLP) anlysis of speech”, J. Acoustic. Soc. Amer. 87, 1738–1752.

Hermansky, H., Ellis, D. P. W., and Sharma, S. (2000). “Tandem connectionist feature extraction for conventional hmm systems”, in Proc. IEEE ICASSP, 1635–1638.

Hermansky, H. and Morgan, N. (1994). “RASTA processing of speech”, IEEE Trans- actions on Speech and Audio Processing 2, 578–589.

Hermansky, H. and Sharma, S. (1999). “Temporal patterns (TRAPS) in ASR of noisy speech”, in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing.

Kil, R. M., Lee, S., and Kim, D. (1999). “Auditory proccessing of speech signals for robust speech recognition in real world noisy environments”, IEEE Trans. on Speech and Audio Processing 7, 55–59.

Kim, C., Chiu, Y.-H., and Stern, R. M. (2006). “Physiologically-motivated synchrony- based processing for robust automatic speech recognition”, in Proc. Interspeech, 1975–1978.

Kim, C. and Stern, R. M. (2009). “Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction”, in Proc. Interspeech, 28–31.

Kim, C. and Stern, R. M. (2010). “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing Conf. on Acoustics, Speech, and Signal Processing, 4574–4577.

Kim, C. and Stern, R. M. (2012). “Power-normalized cepstral coefficients (PNCC) for robust speech recognition”, IEEE Trans. on Audio, Speech, and Language Proc. (accepted for pubication) .

Kingsbury, B. E. D., Morgan, N., and Greenberg, S. (1998). “Robust speech recognition using the modulation spectrogram”, Speech Communication 25, 117–132. Kleinschmidt, M. (2003). “Localized spectro-temporal features for automatic speech recognition”, in Proc. Eurospeech, 2573–2576.

Lyon, R. F. (1982). “A computational model of filtering, detection and compression in the cochlea”, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1282–1285 (Paris).

Mesgarani, N., Slaney, M., and Shamma, S. A. (2006). “Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations”, IEEE Trans. on Audio, Speech, and Language Proc. 14, 920–929.

Moore, B. C. J. (2003). An Introduction to the Psychology of Hearing, fifth edition (Academic Press, London).

Moreno, P. J., Raj, B., and Stern, R. M. (1996). “A vector taylor series approach for environment-independent speech recognition”, in Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, 733–736.

Pickles, J. O. (2008). An Introduction to the Physiology of Hearing, 3 edition (Academic Press).

Rabiner, L. R. and Juang, B.-H. (1993). Fundamentals of Speech Recognition (Prentice-Hall).

Ravuri, S. (2011). “On the use of spectro-temporal features in noise-additive speech”, Master’s thesis, University of California, Berkeley.

Seneff, S. (1988). “A joint synchrony/mean-rate model of auditory speech processing”, J. Phonetics 15, 55–76.

Tchorz, J. and Kollmeier, B. (1999). “A model of auditory perception as front end for automatic speech recognition”, J. Acoustic. Soc. Amer. 106, 2040–2060.

Wang, K. and Shamma, S. A. (1994). “Self-normalization and noise-robustness in early auditory representations”, IEEE Trans. on Speech and Audio Processing 2, 421–435.

Yost, W. A. (2006). Fundamentals of Hearing: An Introduction, 5 edition (Emerald Group Publishing).

Young, E. D. and Sachs, M. B. (1979). “Representation of steady-state vowels in the temporal aspects of the discharge patterns of populations of auditory-nerve fibers”, J. Acoustic. Soc. Amer. 66, 1381–1403.

Zhang, X., Heinz, M. G., Bruce, I. C., and Carney, L. H. (2001). “A phenomenologial model for the response of auditory-nerve fibers: I. nonlinear tuning with compression and suppresion”, Journal of the Acoustical Society of America 109, 648–670.

Applying physiologically-motivated models of auditory processing to automatic speech recognition

Forfattere

Resumé

Referencer

Yderligere filer

Publiceret

Citation/Eksport

Nummer

Sektion

Licens

Indsend

Information

Sprog