The next generation of audio intelligence: A survey-based perspective on improving audio analysis

  • Björn Schuller GLAM – Group on Language, Audio & Music, Imperial College London, SW7 2AZ London, UK; Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany; audEERING GmbH, 82205 Gilching, Germany
  • Shahin Amiriparian Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
  • Gil Keren Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
  • Alice Baird Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
  • Maximilian Schmitt Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
  • Nicholas Cummins Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
Keywords: Computer Audition, Audio Intelligence, Survey, Auditory Scene Analysis, Source Separation, Audio Ontologies, Audio Diarisation, Audio Understanding

Abstract

Computer audition has made major progress over the past decades; however it is still far from achieving human-level hearing abilities. Imagine, for example, the sounds associated with putting a water glass onto a table. As humans, we would be able to roughly “hear” the material of the glass, the table, and perhaps even how full the glass is. Current machine listening approaches, on the other hand, would mainly recognise the event of “glass put onto a table”. In this context, this contribution aims to provide key insight into the already made remarkable advances in computer audition. It also identifies deficits in reaching human-like hearing abilities, such as in the given example. We summarise the state-of-the-art in traditional signal-processing-based audio pre-processing and feature representation, as well as automated learning such as by deep neural networks. This concerns, in particular, audio diarisation, source separation, understanding, but also ontologisation. Based on this, concluding avenues are given towards reaching the ambitious goal of “holistic human-parity” machine listening abilities – the next generation of audio intelligence.

References

Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Parikh, D., and Batra, D. (2017), “VQA: Visual Question Answering,” Int. J. Comput. Vis., 123(1), 4–31.

Allik, A., Fazekas, G., and Sandler, M. B. (2016), “An Ontology for Audio Features,” Proc. International Society for Music Information Retrieval Conference (ISMIR) (ISMIR, New York, NY), 73–79.

Amiriparian, S., Pugachevskiy, S., Cummins, N., Hantke, S., Pohjalainen, J., Keren, G., and Schuller, B. (2017), “CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms,” Proc. Biannual Conference on Affective Computing and Intelligent Interaction(ACII) (San Antonio, TX), 340–345.

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., and Vinyals, O. (2012), “Speaker diarization: A review of recent research,” IEEE Trans. Audio Speech Lang. Process., 20(2), 356–370.

Aytar, Y., Vondrick, C., and Torralba, A. (2016), “SoundNet: Learning sound representations from unlabeled video,” Proc. Advances in Neural Information Processing Systems (NIPS) (MIT Press, Barcelona, Spain), 892–900.

Bredin, H. (2017), “TristouNet: Triplet Loss for Speaker Turn Embedding,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, New Orleans, LA), 5430–5434.

Buitelaar, P., Cimiano, P., and Magnini, B. (2005), Ontology learning from text: methods, evaluation and applications (Impacting the World of Science Press, Amsterdam, The Netherlands).

Coutinho, E., Weninger, F., Schuller, B., and Scherer, K. (2014), “The Munich LSTM- RNN approach to the MediaEval 2014 “Emotion in Music” Task,” Proc. MediaEval Multimedia Benchmark Workshop (CEUR, Barcelona, Spain), no pagination.

Davis, K., Biddulph, R., and Balashek, S. (1952), “Automatic recognition of spoken digits,” J. Acoust. Soc. Am., 24(6), 637–642.

Drossos, K., Adavanne, S., and Virtanen, T. (2017), “Automated audio captioning with recurrent neural networks,” Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (IEEE, New Paltz, NY), 374–378.

Durand, N., Derivaux, S., Forestier, G., Wemmert, C., Gançarski, P., Boussaid, O., and Puissant, A. (2007), “Ontology-based object recognition for remote sensing image interpretation,” Proc. IEEE International Conference on Tools with Artificial Intelligence (ICTAI) (IEEE, Patras, Greece), 472–479.

Ehrig, M. and Maedche, A. (2003), “Ontology-focused Crawling of Web Documents,” Proc. ACM Symposium on Applied Computing (SAC) (ACM, Melbourne, Florida), 1174–1178.

Fan, Z., Lai, Y., and Jang, J. R. (2018), “SVSGAN: Singing Voice Separation Via Generative Adversarial Network,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Calgary, Canada), 726–730.

Farquhar, A., Fikes, R., and Rise, J. (1997), “The Ontolingua Server: A tool for collaborative ontology construction,” Int. J. Hum.-Comput. St., 46(6), 707–727.

Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., and McCree, A. (2017), “Speaker diarization using deep neural network embeddings,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA), 4930–4934.

Gemmeke, J., Ellis, D., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017), “Audio set: An ontology and human-labeled dataset for audio events,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, New Orleans, LA), 776–780.

Gotmare, P. (2017), “Methodology for Semi-Automatic Ontology Construction using Ontology learning: A Survey,” IJCA Proceedings on Emerging Trends in Computin, volume ETC-2016, 1–3.

Han, B., Rho, S., Jun, S., and Hwang, E. (2010), “Music emotion classification and context-based music recommendation,” Multimed. Tools Appl., 47(3), 433–460.

Hansen, J. H. L. and Hasan, T. (2015), “Speaker Recognition by Machines and Humans: A tutorial review,” IEEE Signal Process. Mag., 32(6), 74–99.

Hantke, S., Zhang, Z., and Schuller, B. (2017), “Towards intelligent crowdsourcing for audio data annotation: Integrating active learning in the real world,” Proc. INTERSPEECH (ISCA, Stockholm, Sweden), 3951–3955.

Hatala, M., Kalantari, L., Wakkary, R., and Newby, K. (2004), “Ontology and rule based retrieval of sound objects in augmented audio reality system for museum visitors,” Proc. ACM Symposium on Applied Computing (SAC) (ACM, Nicosia, Cyprus), 1045–1050.

Hilario, M., Kalousis, A., Nguyen, P., and Woznica, A. (2009), “A data mining ontology for algorithm selection and meta-mining,” Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) (Bled, Slovenia), 76–87.

Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kingsbury, B., and Sainath, T. (2012), “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Process. Mag., 29(6), 82–97.

Hoelzl, G., Ferscha, A., Halbmayer, P., and Pereira, W. (2014), “Goal oriented smart watches for cyber physical superorganisms,” Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication (ACM, Seattle, WA), 1071–1076.

Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. (2015), “Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation,” IEEE/ACM Trans. Audio Speech Lang. Process., 23(12), 2136–2147.

Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. (2017), “Singing voice separation with deep U-Net convolutional networks,” Proc. International Society for Music Information Retrieval Conference (ISMIR) (ISMIR, Suzhou, China), 323–332.

Kong, Q., Xu, Y., Wang, W., and Plumbley, M. D. (2017), “Music Source Separation using Weakly Labelled Data,” Proc. International Society for Music Information Retrieval Conference (ISMIR) (Suzhou, China), no pagination.

Le Lan, G., Charlet, D., Larcher, A., and Meignier, S. (2017), “A Triplet Ranking-based Neural Network for Speaker Diarization and Linking,” Proc. INTERSPEECH (ISCA, Stockholm, Sweden), 3572–3576.

Le Roux, J., Hershey, J. R., and Weninger, F. (2015), “Deep NMF for speech separation,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Brisbane, Australia), 66–70.

Maedche, A. and Staab, S. (2001), “Ontology learning for the semantic web,” IEEE Intell. Syst., 16(2), 72–79.

Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., and Plumbley, M. D. (2018), “Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge,” IEEE/ACM Trans. Audio. Speech Lang. Process., 26(2), 379–393.

Naithani, G., Parascandolo, G., Barker, T., Pontoppidan, N. H., and Virtanen, T. (2016), “Low-latency sound source separation using deep neural networks,” Proc. Global Conference on Signal and Information Processing (GlobalSIP) (Washington, DC), 272–276.

Nakatani, T. and Okuno, H. G. (1998), “Sound ontology for computational auditory scene analysis,” Proc. Conference of the Association for the Advancement of Artificial Intelligence (AAAI) (Madison, WI), 1004–1010.

Nikunen, J., Diment, A., and Virtanen, T. (2018), “Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking,” IEEE/ACM Trans. Audio Speech Lang. Process., 26(2), 281–295.

Noy, N. F., Chugh, A., Liu, W., and Musen, M. A. (2006), “A Framework for Ontology Evolution in Collaborative Environments,” Proc. International Semantic Web Conference (ISWC) (Athens, GA), 544–555.

Ozerov, A., Fe ́votte, C., Blouet, R., and Durrieu, J.-L. (2011), “Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Prague, Czech Republic), 257–260.

Raimond, Y., Abdallah, S. A., Sandler, M. B., and Giasson, F. (2007), “The Music Ontology,” Proc. International Society for Music Information Retrieval Conference (ISMIR) (Vienna, Austria), 417–422.

Reynolds, D. A. and Torres-Carrasquillo, P. (2005), “Approaches and applications of audio diarization,” Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (IEEE, Philadelphia, PA), 953–956.

Rohrbach, A., Rohrbach, M., Tandon, N., and Schiele, B. (2015), “A dataset for Movie Description,” Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Boston, MA), 3202–3212.

Subakan, Y. C. and Smaragdis, P. (2018), “Generative Adversarial Source Separation,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Calgary, Canada), 26–30.

Sun, Y., Zhu, L., Chambers, J. A., and Naqvi, S. M. (2017), “Monaural source separation based on adaptive discriminative criterion in neural networks,” Proc. International Conference on Digital Signal Processing (DSP) (London, UK), 1–5.

Torabi, A., Pal, C., Larochelle, H., and Courville, A. (2015), “Using descriptive video services to create a large data source for video annotation research,” arXiv preprint arXiv:1503.01070.

Uhlich, S., Porcu, M., Giron, F., Enenkl, M., Kemp, T., Takahashi, N., and Mitsufuji, Y. (2017), “Improving music source separation based on deep neural networks through data augmentation and network blending,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (New Orlearns, LA), 261– 265.

Venkataramani, S., Casebeer, J., and Smaragdis, P. (2017), “Adaptive Front-ends for End-to-end Source Separation,” Proc. Conference on Neural Information Processing Systems (NIPS) (Long Beach, CA), no pagination.

Vicient, C., Sa ́nchez, D., and Moreno, A. (2013), “An automatic approach for ontology- based feature extraction from heterogeneous textualresources,” Eng. Appl. Artif. Intel., 26(3), 1092–1106.

Virtanen, T. (2007), “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. Audio Speech Lang. Process., 15(3), 1066–1074.

Wang, Q., Downey, C., Wan, L., Mansfield, P. A., and Moreno, I. L. (2017), “Speaker diarization with LSTM,” arXiv preprint arXiv:1609.04301.

Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., and Schuller, B. (2015), “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” Proc. International Conference on Latent Variable Analysis and Signal Separation (Liberec, Czech Republic), 91–99.

Weninger, F., Eyben, F., and Schuller, B. (2014a), “Single-channel speech separation with memory-enhanced recurrent neural networks,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Florence, Italy), 3709– 3713.

Weninger, F., Hershey, J. R., Le Roux, J., and Schuller, B. (2014b), “Discriminatively trained recurrent neural networks for single-channel speech separation,” Proc. Global Conference on Signal and Information Processing (GlobalSIP) (Atlanta, GA), 577–581.

Weninger, F., Watanabe, S., Tachioka, Y., and Schuller, B. (2014c), “Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Florence, Italy), 4623–4627.

Weninger, F., Wo ̈llmer, M., and Schuller, B. (2012), “Combining Bottleneck-BLSTM and Semi-Supervised Sparse NMF for Recognition of Conversational Speech in Highly Instationary Noise,” Proc. INTERSPEECH (Portland, OR), 302–305.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. (2015), “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” Proc. International Conference on Machine Learning (ICML) (Lille, France), 2048–2057.

Zheng, H.-T., Kang, B.-Y., and Kim, H.-G. (2008), “An ontology-based approach to learnable focused crawling,” Inf. Sci., 178(23), 4512–4522.

Published
2020-04-24
How to Cite
Schuller, B., Amiriparian, S., Keren, G., Baird, A., Schmitt, M., & Cummins, N. (2020). The next generation of audio intelligence: A survey-based perspective on improving audio analysis. Proceedings of the International Symposium on Auditory and Audiological Research, 7, 101-112. Retrieved from https://proceedings.isaar.eu/index.php/isaarproc/article/view/2019-13
Section
2019/3. Machine listening and intelligent auditory signal processing