Prediction of speech intelligibility with DNN-based performance measures
In this paper, we present a speech intelligibility model based on automatic speech recognition (ASR) that combines phoneme probabilities obtained from a deep neural network and a performance measure that estimates the word error rate from these probabilities. In contrast to previous modeling approaches, this model does not require the clean speech reference or the exact word labels during test time, and therefore, less a priori information. The model is evaluated via the root mean squared error between the predicted and observed speech reception thresholds from eight normal-hearing listeners. The recognition task in both cases consists of identifying noisy words from a German matrix sentence test. The speech material was mixed with four noise maskers covering different types of modulation. The prediction performance is compared to four established models as well as to the ASR-model using word labels. The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models on average.
ANSI, S3 22-1997 (1997), “Methods for calculation of the speech intelligibility index,” American National Standard Institute.
Barker, J. and Cooke, M. (2006), “Modelling speaker intelligibility in noise,” Speech Commun.
Brand, T. and Kollmeier, B. (2002), “Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests,” J. Acoust. Soc. Am., 111(6), 2801–2810.
Castro Martinez, A. M., Gerlach, L., Paya ́-Vaya ́, G., Hermansky, H., Ooster, J., and Meyer, B. T. (2019), “DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters,” Speech Commun., 106, 44–56.
Ewert, S. D. and Dau, T. (2000), “Characterizing frequency selectivity for envelope fluctuations.” J. Acoust. Soc. Am., 108(3), 1181–96.
Hermansky, H., Variani, E., and Peddinti, V. (2013), “Mean temporal distance: Predicting ASR error from temporal properties of speech signal,” Proc. IEEE ICASSP, 7423–7426.
Holube, I., Fredelake, S., Vlaming, M., and Kollmeier, B. (2010), “Development and analysis of an International Speech Test Signal (ISTS).” Int. J. Audiol., 49(12), 891–903.
Huber, R., Kru ̈ger, M., and Meyer, B. T. (2018), “Single-ended prediction of listening effort using deep neural networks,” Hearing Res., 359, 40–49.
Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T., and Banno, H. (2008), “Tandem-straight: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation,” Proc. IEEE ICASSP, 3933–3936.
Kollmeier, B., Warzybok, A., Hochmuth, S., Zokoll, M. A., Uslar, V., Brand, T., and Wagener, K. C. (2015), “The multilingual matrix test: Principles, applications, and comparison across languages: A review,” Int. J. Audiol., 54(sup2), 3–16.
Moritz, N., Anemu ̈ller, J., and Kollmeier, B. (2015), “An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition,” IEEE Trans. Audio Speech Lang. Process., 23(11), 1926–1937.
Peddinti, V., Povey, D., and Khudanpur, S. (2015), “A time delay neural network architecture for efficient modeling of long temporal contexts,” Proc. International Speech Communication Association.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. (2011), “The Kaldi speech recognition toolkit,” Proc. IEEE ASRU.
Rhebergen, K. S. and Versfeld, N. J. (2005), “A speech intelligibility index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners.” J. Acoust. Soc. Am., 117(4), 2181–2192.
Schädler, M. R., Warzybok, A., Hochmuth, S., and Kollmeier, B. (2015), “Matrix sentence intelligibility prediction using an automatic speech recognition system.” Int. J. Audiol., 1–8.
Schubotz, W., Brand, T., Kollmeier, B., and Ewert, S. D. (2016), “Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features,” J. Acoust. Soc. Am., 140(1), 524–540.
Spille, C., Ewert, S. D., Kollmeier, B., and Meyer, B. T. (2018), “Predicting speech intelligibility with deep neural networks,” Comput. Speech Lang., 48, 51–66.
Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. (2011), “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio Speech Lang. Process., 19(7), 2125–2136.
Wagener, K., Brand, T., and Kollmeier, B. (1999), “Development and evaluation of a German sentence test part III: Evaluation of the Oldenburg sentence test,” Zeitschrift Fur Audiologie, 38, 86–95.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright* and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
*From the 2017 issue onward. The Danavox Jubilee Foundation owns the copyright of all articles published in the 1969-2015 issues. However, authors are still allowed to share the work with an acknowledgement of the work's authorship and initial publication in this journal.