Speech reception in noise: How much do we understand?

Authors

  • Birger Kollmeier Medical Physics, Universität Oldenburg, D-26111 Oldenburg, Germany
  • Bernd Meyer Medical Physics, Universität Oldenburg, D-26111 Oldenburg, Germany
  • Tim Jürgens Medical Physics, Universität Oldenburg, D-26111 Oldenburg, Germany
  • Rainer Beutelmann Medical Physics, Universität Oldenburg, D-26111 Oldenburg, Germany
  • Ralf M. Meyer Medical Physics, Universität Oldenburg, D-26111 Oldenburg, Germany
  • Thomas Brand Medical Physics, Universität Oldenburg, D-26111 Oldenburg, Germany

Abstract

In order to better understand the effect of hearing impairment on speech perception in everyday listening situations as well as the still limited benefit of modern hearing instruments in this situations, a thorough understanding of the underlying mechanisms and factors in uencing speech reception in noise is highly desirable. This contribution therefore reviews a series of studies by our group to model speech reception in normal and hearing-impaired listeners in a multidisciplinary approach using “classical” speech intelligibility models, functional perception models, automatic speech recognition (ASR) technology, as well as inputs from psycholinguistics. Classical speech-information-based models like the Articulation Index or speech intelligibility index (SII) describe the acoustical layer and yield accurate predictions only for average intelligibility scores and for a limited set of acoustical situations. With appropriate extentions they can model more audibility-driven and even time-dependent acoustical situations, such as, e.g. the effect of hearing impairment in fluctuating noise. However, to describe the sensory layer and suprathreshold processing de cits in humans, the combination of a psychoacoustically motivated preprocessing model with a pattern recognition algorithm adopted from ASR technology appears advantageous. It allows a detailed analysis of phoneme confusions and the “man-machine-gap” of approx. 12 dB in SNR, i.e., the superiority of human world-knowledge-driven (top-down) speech pattern recognition in comparison to the training-data-driven (bottom-up) machine learning approaches. Finally, the cognitive abilities of human listeners when understanding speech can be assessed by a “fair” comparison between Human Speech recognition and ASR that employs only a limited set of training data. In summary, both bottom-up and top-down strategies have to be assumed when trying to understand speech reception in noise. Computer models that assume a near-to-perfect “world knowledge”, i.e., anticipation of the speech unit to be recognized, can surprisingly well predict the performance of human listeners in noise and may prove to be a useful tool in hearing aid development.

References

ANSI (1997). “Methods for Calculation of the Speech Intelligibility Index“, American National Standard S3.5-1997.

Beutelmann, R., and Brand, T. (2006). “Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners.,” J. Acoust. Soc. Am. 2006. 120: p. 331-42.

Brand, T., and Kollmeier B. (2002b). “Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests,” J. Acoust. Soc. Am. 2002. 111(6): p. 2801-2810.

Brand, T., and Kollmeier, B. (2002a). “Vorhersage der Sprachverständlichkeit in Ruhe und im Störgeräusch aufgrund des Reintonaudiogramms,” DGA 2002.

Dau, T. (1997). “Modeling auditory processing of amplitude modulation,” J. Acoust. Soc. Am. 1997. 101: p. 3061 (A).

Dau, T., Püschel, D., and Kohlrausch A. (1996). “A quantitative model of the “effective” signal processing in the auditory system: I. Model structure,” 1996. 99: p. 3615-3622.

Dau, T., Kollmeier, B., and Kohlrausch, A. (1997). “Modeling auditory processing of amplitude modulation: I. Detection and masking with narrow band carrier,” J. Acoust. Soc. Am., 102, 2892-2905.

Demuynck, K., Garcia, O., and Dirk Van Compernolle, (2004): “Synthesizing Speech from Speech Recognition Parameters,” In Proc. ICSLP 2004, vol. II, 945–948.

Dreschler, W. A., (2001). “ICRA Noises: Arti cial Noise Signals with Speech-like Spectral and Temporal Properties for Hearing Instrument Assessment,” Audiology. 40, 148–157.

Dreschler, W. A., van Esch, T. E. M., and Jeroen Sol, J. (2007). “Diagnosis of impaired speech perception by means of the Auditory Pro le,” this volume.

Durlach, N. I. (1963). “Equalization and cancellation theory of binaural masking-level differences”, J. Acoust. Soc. Am. 35(8), p. 1206–1218.

Fletcher, H., Galt (1950). “The Perception of Speech and Its Relation to Telephony,” J. Acoust. Soc. Am. 1950 22(2), 89-151.

Hohmann, V. (2002). “Frequency analysis and synthesis using a Gammatone filterbank,” Acta acustica / Acustica, 2002. 88(3). 433-442.

Holube, I., and B. Kollmeier (1996). “Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model,” J. Acoust. Soc. Am., 1996. 100(3). 1703-16.

Houtgast, T., Steeneken, H. J. M. (1985). “A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria,” J. Acoust. Soc. Am. 77(3). 1069-1077.

Jürgens, T., Brand, T., and Kollmeier, B. (2007). “Modelling the Human-Machine Gap in Speech Reception: Microscopic Speech Intelligibility Prediction for Normal-Hearing Subjects with an Auditory Model,” In Proc. Interspeech 2007, Antwerpen.

Kleinschmidt, M. (2003). “Localized spectro-temporal features for automatic speech recognition,” Proc. Eurospeech/Interspeech, Geneva, 2003.

Kollmeier, B. (1990). “Meßmethodik, Modellierung und Verbesserung der Verständlichkeit von Sprache,” Habilitationsschrift, Universität Göttingen

Larsby, B., Hällgren, M., Lyxell, B., Arlinger, S. (2005). “Cognitive performance and perceived effort in speech processing tasks: effects of different noise backgrounds in normal-hearing and hearing-impaired subjects,” Int. J. Audiol. 44(3), 131-143.

Lippmann, R.P. (1997). ”Speech recognition by machines and humans,” Speech Com- munication 22 (1), 1-15, 1997.

Meyer, B., Wächter, M., Brand, T. and Kollmeier, B. (2007): “Phoneme confusions in human and automatic speech recognition,” In Proc. Interspeech 2007, Antwerpen, Belgium, 2007.

Meyer, R., Brand, T. (2007). “Prediction of speech intelligibility in fluctuating noise,” in: EFAS/DGA 2007, Heidelberg (in press).

Pavlovic, C. V. (1984). “Use of the articulation index for assessing residual auditory function in listeners with sensorineural hearing impairment,” J. Acoust. Soc. Am. 75(4), 1253–1258.

Payton, K.L., Braida, L.D. (1999). ”A method to determine the speech transmission index from speech waveforms,” J. Acoust. Soc. Am. 106(6), 3637-3648.

Plomp, R. (1986). “A Signal-to-Noise Ratio Model for the Speech-Reception Threshold of the Hearing Impaired,” J.Sp.Hear. Res. 29, 146-154.

Rankovic, C. M. (1997). “Prediction of Speech Reception by Listeners With Sensorineural Hearing Loss,” in: Jestaeadt, W. (ed.): Modeling Sensorineural Hearing Loss, Lawrence Earlbaum Associates, Mahwah, N.J.

Rhebergen, K., and Versfeld, N. (2005). “A Speech Intelligibility Index-based approach to predict the speech reception threshold for sentences in fluctuating noise for nor- mal-hearing listeners,” J. Acoust. Soc. Am. 117, 2181-92.

Sakoe, H., and S. Chiba (1978). “Dynamic Programming Algorithm Optimization for Spoken Word Recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-26(1), 43-49.

Scharenborg, O. (2005). “Narrowing the gap between automatic and human word recognition,” Ph.D. thesis, Radboud University Nijmegen, September 16th, 2005.

Sroka, J. J., and Braida, L. D. (2005). “Human and Machine Consonant Recognition,” Speech Communication 45 (401-423), 2005.

Vom Hövel, H. (1984). “Zur Bedeutung der Übertragungseigenschaften des Außenohrs sowie des binauralen Hörsystems bei Gestörter Sprachübertragung,” Dissertation, Fakultät für Elektrotechnik, RTWH Aachen.

Wagener, K., Brand T., Kollmeier, B. (2006). “The role of silent intervals for sentence intelligibility in fluctuating noise in hearing-impaired listeners,“ Int. J. Audiol. 45, 26-3.

Wagener, K., Brand, T., Kühnel, V., and Kollmeier, B. (1999). „Entwicklung und Evaluation eines Satztestes für die deutsche Sprache I-III: Design, Optimierung und Evaluation des Oldenburger Satztestes,“ Zeitschrift für Audiologie 38(1-3)

Wagener, K., and Brand, T. (2005). “Sentence intelligibility in noise for listeners with normal hearing and hearing impairment: influence of measurement procedure and masking parameters,” Int. J. of Audiol. 44(3), p. 144-156.

Wagener, K., Brand, T., and Kollmeier, B. (2006). “The role of silent intervals for sentence intelligibility in fluctuating noise in hearing--impaired listeners,” Int. J. of Audiol. 45(1), 26-33.

Wagener, K. C., Brand, T., and Kollmeier, B (2007). “International cross-validation of sentence intelligibility tests”, EFAS - Meeting 2007, Heidelberg (in press).

Warzybok, A., Wagener, K.C., Brand, T. (2007). “Intelligibility of German digit triplets for non-native German listeners,” EFAS - Meeting 2007, Heidelberg (in press)

Wesker, T., Meyer, B., Wagener, K., Anemüller, J., Mertins, A., and Kollmeier, B. (2005). ”Oldenburg Logatome Speech Corpus (OLLO) for Speech Recognition Experiments with Humans and Machines,” In Proc. Interspeech 2005, Lisbon, Por- tugal, 1273-1276.

Zurek, P. M. (1990). “Binaural advantages and directional effects in speech intelligibility,” in Acoustical Factors Affecting Hearing Aid Performance, 2nd ed., edited by G. A. Studebaker and I. Hockberg, Allyn and Bacon, London, Chap. 15, 255–276.

Additional Files

Published

2007-12-15

How to Cite

Kollmeier, B., Meyer, B., Jürgens, T., Beutelmann, R., Meyer, R. M., & Brand, T. (2007). Speech reception in noise: How much do we understand?. Proceedings of the International Symposium on Auditory and Audiological Research, 1, 335–350. Retrieved from https://proceedings.isaar.eu/index.php/isaarproc/article/view/2007-32

Issue

Section

2007/4. Speech perception and processing