Feature-based audiovisual speech integration of multiple streams

  • Juan Camilo Gil-Carvajal Cognitive Systems, DTU Compute, Technical University of Denmark, DK-2800 Lyngby, Denmark; Hearing Systems, DTU Health Tech, Technical University of Denmark, DK-2800 Lyngby, Denmark
  • Jean-Luc Schwartz GIPSA-lab, Univ, Grenoble Alpes, CNRS, Grenoble INP
  • Torsten Dau Hearing Systems, DTU Health Tech, Technical University of Denmark, DK-2800 Lyngby, Denmark
  • Tobias Søren Andersen Cognitive Systems, DTU Compute, Technical University of Denmark, DK-2800 Lyngby, Denmark
Keywords: Audiovisual, Speech perception, McGurk effect, Perceived consonant order

Abstract

Speech perception often involves the integration of auditory and visual information. This is shown in the McGurk effect, in which a visual utterance, e.g., /ipi/, dubbed onto an acoustic utterance, e.g., /iki/, produces a combination percept, e.g., /ipki/. However, it is still unclear how phonetic features are integrated audiovisually. Here, we studied audiovisual speech perception by decomposing the auditory component of McGurk combinations into two streams. We show that auditory /i_i/, where the underscore indicates an intersyllabic silence, dubbed onto visual /ipi/ produce a strong illusion of hearing /ipi/. We also show that adding an acoustic release burst to /i_i/ creates a percept of /iki/. An auditory continuum was created with stepwise temporal alignments of the release burst and /i_i/. When dubbed onto /ipi/, this continuum was perceived mostly as a visually driven response /ipi/ when the burst overlapped with either acoustic vowel. Other temporal alignments frequently produced combination responses. Mostly /ikpi/ combinations were obtained when the burst was closer to the initial vowel, and reverse /ipki/ responses when it was closer to the final vowel. These results are indicative of feature-based audiovisual integration where burst and aspiration are sufficient cues for the consonant /k/, while the perception of /p/ depends on place information in the visual stream.

References

Binnie, C. A., Montgomery, A. A., and Jackson, P. L. (1974). “Auditory and visual contributions to the perception of consonants.” J. Speech, Lang. Hear. R., 17, 619-630.

Colin, C., Radeau, M., Deltenre, P., Demolin, D., and Soquet, A. (2002). “The role of sound intensity and stop-consonant voicing on McGurk fusions and combinations.” Eur. J. Cogn. Psychol., 14, 475-491.

Green, K. P., and Norrix, L. W. (1997). “Acoustic cues to place of articulation and the McGurk effect: the role of release bursts, aspiration, and formant transitions.” J. Speech Lang. Hear. R., 40, 646-665.

Hampson, M., Guenther, F. H., Cohen, M. A., and Nieto-Castanon, A. (2003). “Changes in the McGurk Effect across phonetic contexts.” Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems.

Massaro, D. W., and Cohen, M. M. (1993). “Perceiving asynchronous bimodal speech in consonant-vowel and vowel syllables.” Speech Commun., 13, 127-134.

McGurk, H., & MacDonald, J. (1976). “Hearing lips and seeing voices”. Nature, 264, 746-748.

Soto-Faraco, S., and Alsius, A. (2009). “Deconstructing the McGurk–MacDonald illusion.” J. Exp. Psychol. Hum. Percept. Perform., 35(2), 580.

Sumby, W. H., and Pollack, I. (1954). “Visual contribution to speech intelligibility in noise.” J. Acoust. Soc. Am., 26, 212-215.

Published
2020-04-24
How to Cite
Gil-Carvajal, J., Schwartz, J.-L., Dau, T., & Andersen, T. (2020). Feature-based audiovisual speech integration of multiple streams. Proceedings of the International Symposium on Auditory and Audiological Research, 7, 333-340. Retrieved from https://proceedings.isaar.eu/index.php/isaarproc/article/view/2019-38
Section
2019/5. Other topics in auditory and audiological research