Prediction of speech intelligibility with DNN-based performance measures

  • Angel Mario Castro Martinez Medizinische Physik and Cluster of Excellence Hearing4all, Carl von Ossietzky Universität Oldenburg, Germany
  • Constantin Spille Medizinische Physik and Cluster of Excellence Hearing4all, Carl von Ossietzky Universität Oldenburg, Germany
  • Birger Kollmeier Medizinische Physik and Cluster of Excellence Hearing4all, Carl von Ossietzky Universität Oldenburg, Germany
  • Bernd T. Meyer Medizinische Physik and Cluster of Excellence Hearing4all, Carl von Ossietzky Universität Oldenburg, Germany
Keywords: Speech Intelligibility, Automatic Speech Recognition, Performance Measures, Deep Learning

Abstract

In this paper, we present a speech intelligibility model based on automatic speech recognition (ASR) that combines phoneme probabilities obtained from a deep neural network and a performance measure that estimates the word error rate from these probabilities. In contrast to previous modeling approaches, this model does not require the clean speech reference or the exact word labels during test time, and therefore, less a priori information. The model is evaluated via the root mean squared error between the predicted and observed speech reception thresholds from eight normal-hearing listeners. The recognition task in both cases consists of identifying noisy words from a German matrix sentence test. The speech material was mixed with four noise maskers covering different types of modulation. The prediction performance is compared to four established models as well as to the ASR-model using word labels. The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models on average.

References

ANSI, S3 22-1997 (1997), “Methods for calculation of the speech intelligibility index,” American National Standard Institute.

Barker, J. and Cooke, M. (2006), “Modelling speaker intelligibility in noise,” Speech Commun.

Brand, T. and Kollmeier, B. (2002), “Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests,” J. Acoust. Soc. Am., 111(6), 2801–2810.

Castro Martinez, A. M., Gerlach, L., Paya ́-Vaya ́, G., Hermansky, H., Ooster, J., and Meyer, B. T. (2019), “DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters,” Speech Commun., 106, 44–56.

Ewert, S. D. and Dau, T. (2000), “Characterizing frequency selectivity for envelope fluctuations.” J. Acoust. Soc. Am., 108(3), 1181–96.

Hermansky, H., Variani, E., and Peddinti, V. (2013), “Mean temporal distance: Predicting ASR error from temporal properties of speech signal,” Proc. IEEE ICASSP, 7423–7426.

Holube, I., Fredelake, S., Vlaming, M., and Kollmeier, B. (2010), “Development and analysis of an International Speech Test Signal (ISTS).” Int. J. Audiol., 49(12), 891–903.

Huber, R., Kru ̈ger, M., and Meyer, B. T. (2018), “Single-ended prediction of listening effort using deep neural networks,” Hearing Res., 359, 40–49.

Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T., and Banno, H. (2008), “Tandem-straight: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation,” Proc. IEEE ICASSP, 3933–3936.

Kollmeier, B., Warzybok, A., Hochmuth, S., Zokoll, M. A., Uslar, V., Brand, T., and Wagener, K. C. (2015), “The multilingual matrix test: Principles, applications, and comparison across languages: A review,” Int. J. Audiol., 54(sup2), 3–16.

Moritz, N., Anemu ̈ller, J., and Kollmeier, B. (2015), “An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition,” IEEE Trans. Audio Speech Lang. Process., 23(11), 1926–1937.

Peddinti, V., Povey, D., and Khudanpur, S. (2015), “A time delay neural network architecture for efficient modeling of long temporal contexts,” Proc. International Speech Communication Association.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. (2011), “The Kaldi speech recognition toolkit,” Proc. IEEE ASRU.

Rhebergen, K. S. and Versfeld, N. J. (2005), “A speech intelligibility index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners.” J. Acoust. Soc. Am., 117(4), 2181–2192.

Schädler, M. R., Warzybok, A., Hochmuth, S., and Kollmeier, B. (2015), “Matrix sentence intelligibility prediction using an automatic speech recognition system.” Int. J. Audiol., 1–8.

Schubotz, W., Brand, T., Kollmeier, B., and Ewert, S. D. (2016), “Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features,” J. Acoust. Soc. Am., 140(1), 524–540.

Spille, C., Ewert, S. D., Kollmeier, B., and Meyer, B. T. (2018), “Predicting speech intelligibility with deep neural networks,” Comput. Speech Lang., 48, 51–66.

Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. (2011), “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio Speech Lang. Process., 19(7), 2125–2136.

Wagener, K., Brand, T., and Kollmeier, B. (1999), “Development and evaluation of a German sentence test part III: Evaluation of the Oldenburg sentence test,” Zeitschrift Fur Audiologie, 38, 86–95.

Published
2020-04-24
How to Cite
Castro Martinez, A., Spille, C., Kollmeier, B., & Meyer, B. (2020). Prediction of speech intelligibility with DNN-based performance measures. Proceedings of the International Symposium on Auditory and Audiological Research, 7, 113-124. Retrieved from https://proceedings.isaar.eu/index.php/isaarproc/article/view/2019-14
Section
2019/3. Machine listening and intelligent auditory signal processing