Modeling auditory and auditory-visual speech intelligibility: Challenges and possible solutions

Authors

  • Ken W. Grant Army Audiology and Speech Center, Walter Reed Army Medical Center, Washington, DC, USA
  • Joshua G. W. Bernstein Army Audiology and Speech Center, Walter Reed Army Medical Center, Washington, DC, USA
  • Elena Grassi Army Audiology and Speech Center, Walter Reed Army Medical Center, Washington, DC, USA

Abstract

Models of speech intelligibility (e.g., Speech Intelligibility Index and Speech Transmission Index) have proven useful in a number of applied (e.g., algorithm development) and theoretical applications (e.g., theories of speech perception). However, in many real-world situations, these models fail to predict accurately speech intelligibility due to the complex nature of the soundscape (e.g., competing talkers), particular attributes of the listener/talker combination (e.g., speaking rate, age, and hearing loss), and presentation modality (auditory or auditory-visual). This paper discusses several of these challenges and recent efforts to address them. Particular attention is paid towards our efforts to model auditory-visual speech intelligibility. Current models of speech intelligibility base their predictions on characteristics of the acoustic speech signal, background noise, and reverberation. However, because visual speech cues are not included in these models, they provide a poor prediction of speech intelligibility in many everyday environments. To address this particular challenge, we describe a method for integrating visual and acoustic speech cues into a uni ed model of speech intelligibility. Kinematic motion from a talkers’ face during speech production is combined with the acoustic speech signal processed by a computational multi-channel model of peripheral auditory analysis. The outputs of the peripheral model are integrated with the visual signal in a weighted fashion based on the degree of to which the visual kinematics are predictive of the acoustic envelopes derived from each frequency channel, yielding an enhanced acoustic signal, especially in the mid-to-high frequencies. Enhanced and unmodi ed noisy speech signals are then processed through a cortical model which extracts critical speech modulations to compute a spectro-temporal modulation index (STMI), yielding predictions for auditory and auditory-visual speech presented in steady-state noise.

References

ANSI (1969). Methods for the calculation of the articulation index, S3.5 (American National Standards Institute, New York).

ANSI (1997). Methods for calculation of the speech intelligibility index, S3.5 (American National Standards Institute, New York).

Bernstein, J. G. W, and Grant, K. W. (this volume). “Frequency importance functions for audiovisual speech and complex noise backgrounds,” in Proceedings of the International Conference of Auditory and Audiological Research, Helsingör, Denmark, August, 2007.

Berthommier, F. (2004). “A phonetically neutral model of the low-level audio-visual interaction," Speech Comm. 44, 31-41.

Boothroyd, A., and Nittrouer, S. (1988) “Mathematical treatment of context effects in phoneme and word recognition,” J. Acoust. Soc. Am., 84, 101-114.

Braida, L. D. (1991). “Crossmodal integration in the identification of consonant segments,” Quarterly J. Exp. Psych. 43, 647-677.

Bregman, A. S. (1990). “Auditory scene analysis: The perceptual organization of sound,” MIT Press, Cambridge, MA.

Brungart, D. S. (2001). “Informational and energetic masking effects in the perception of two simultaneous talkers,” J. Acoust. Soc. Am. 109, 1101-1109.

Chi, T., Gao, Y., Guyton, M. C., Ru, P., and Shamma, S. (1999). “Spectro-temporal modulation transfer functions and speech intelligibility,” J. Acoust. Soc. Am. 106, 2719-2732.

Elihilali, M., Taishih, C., and Shamma, S. A. (2003). “A spectro-temporal modulation index (STMI) for assessment of speech intelligibility,” Speech Comm. 41, 331-348.

Festen, J. M., and Plomp, R. (1990). “Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing,” J. Acoust. Soc. Am. 88, 1725-1736.

French, N. R., and Steinberg, J. C. (1947). “Factors governing the intelligibility of speech sounds,” J Acoust. Soc. Am. 19, 90-119.

Garofolo, J. (1988). “Getting Started with the DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database,” National Institute of Standards and Technology NIST, Gaithersburg, MD.

George, E. L. J., Festen, J. M., and Houtgast, T. (2006). “Factors affecting masking release for speech in modulated noise for normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am., 120(4), 2295-2311.

Girin, L., Schwartz, J.-L., and Feng, G. (2001). "Audio-visual enhancement of speech in noise," J. Acoust. Soc. Am. 109, 3007-3020.

Glasberg, B. R., and Moore, B. C. J. (1989). “Psychoacoustic abilities of subjects with unilateral and bilateral cochlear impairments and their relationship to the ability to understand speech,” Scand. Audiol. Suppl., 32, 1-25.

Grant, K. W. (2001). “The effect of speechreading on masked detection thresholds for filtered speech,” J. Acoust. Soc. Am. 109, 2272-2275.

Grant, K. W. (2006). “Frequency-band importance functions for auditory and auditory- visual speech by hearing-impaired adults,” J. Acoust. Soc. Am. 119, 3416-3417.

Grant, K. W., and Bernstein, J.G.W. (2007). “Frequency band-importance functions for auditory and auditory-visual sentence recognition,” J. Acoust. Soc. Am. 121, 3044 (A).

Grant, K. W., Elhilali, M., Shamma, S. A., Walden, B. E., Surr; R. K., Cord, M. T., and Summers, V. (in press). “An objective measure for selecting microphone modes in OMNI/DIR hearing-aid circuits,” Ear Hear.

Grant, K. W., and Seitz, P. (2000). “The use of visible speech cues for improving auditory detection of spoken sentences,” J. Acoust. Soc. Am. 108, 1197-1208.

Grant, K. W., and Walden, B. E. (1996). “Evaluating the articulation index for auditory-visual consonant recognition,” J. Acoust. Soc. Am. 100, 2415-2424.

IEEE (1969). IEEE recommended practice for speech quality measures (Institute of Electrical and Electronic Engineers, New York).

Lopez-Poveda, E. A., and Meddis, R. (2001). “A human nonlinear cochlear filterbank,” J. Acoust. Soc. Am. 110, 3107-3118.

Massaro, D. W. (1998). ”Perceiving talking faces: From speech perception to a behavioral principle,” (MIT Press, Cambridge).

Oxenham, A. J., and Dau, T. (2001). “Modulation detection interference: Effect of concurrent and sequential streaming,” J. Acoust. Soc. Am., 110, 402-408.

Plomp, R. (1978). “Auditory handicap of hearing impairment and the limited benefit of hearing aids,” J. Acoust. Soc. Am. 63, 533-549.

Rhebergen, K. S., Versfeld, N. J., and Dreschler, W. A. (2006). “Extended speech intelligibility index for the prediction of the speech reception threshold in fluctuating noise,” J. Acoust. Soc. Am. 120, 3988-3997.

Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). "Speech recognition with primarily temporal cues," Science 270, 303-304.

Steeneken, H. J. M., and Houtgast, T. (1980). “A physical method for measuring speech-transmission quality,” J. Acoust. Soc. Am. 67, 318-326.

Summers, V., Makashay, M. J., Grassi, E., Grant, K. W., Bernstein, J. G. W, Walden, B. E., Leek, M. R., and Molis, M. R. (this volume). “Toward an individual-specific model of impaired speech intelligibility,” in Proceedings of the International Conference of Auditory and Audiological Research, Helsingör, Denmark, August, 2007.

Yost, W. A., and Sheft, S. (1994). “Modulation detection interference: Across-frequency processing and auditory grouping,” Hear. Res., 79, 48-58.

Zilany, M. S. A., and Bruce, I. C. (2007). “Predictions of speech intelligibility with a model of the normal and impaired auditory-periphery,” in Proceedings of the 3rd International IEEE/EMBS Conference on Neural Engineering, Kohala Coast, HI, 481-485.

Additional Files

Published

2007-12-15

How to Cite

Grant, K. W., Bernstein, J. G. W., & Grassi, E. (2007). Modeling auditory and auditory-visual speech intelligibility: Challenges and possible solutions. Proceedings of the International Symposium on Auditory and Audiological Research, 1, 47–58. Retrieved from https://proceedings.isaar.eu/index.php/isaarproc/article/view/2007-05

Issue

Section

2007/1. Auditory signal processing and perception