Data-driven mask generation for source separation

Authors

  • Nilesh Madhu ExpORL, Dept. Neurosciences, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium

Abstract

Presented is a microphone-array based approach for the extraction of a target signal from a mixture of compteting sources and background noise. The approach builds upon a recent proposal for source localization and tracking in the general M-microphone Q-source case, and extends it to a versatile framework to perform source separation using data-driven soft– or hard– masks. The proposed approach is applicable to any arbitrary array – allowing for its integration into binaural hearing aids. The advantage of the proposed mask generation, in contrast to current algorithms, is the implicit scalability with respect to M, Q, source spread and the amount of reverberation – obviating the need for a heuristic adaptation of the mask generation algorithm in different acoustical scenarios. Further, the individual signals extracted using these soft-masks evince low amounts of musical noise. Additional mask smoothing may be performed to further reduce the musical noise phenomenon, thereby improving the listening experience.

References

ANSI-S3.5 (1997), American national standard methods for the calculation of the speech intelligibility index (American National Standards Institute, New York).

Bilmes, J. A. (1998), “A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models,” Tech. Rep. TR-97-021, U.C. Berkeley.

Bodden, M. (1992), “Binaurale Signalverarbeitung: Modellierung der Richtungs- erkennung und des Cocktail-Party-Effektes,” Ph.D. thesis, Institute of Communi- cation Acoustics, Ruhr-Universität Bochum.

Breithaupt, C., Madhu, N., Hummes, F., and Martin, R. (2005), “A robust steerable realtime multichannel noise reduction system,” in 2005 Joint Workshop on Hands- Free Speech Communication and Microphone Arrays (HSCMA2005).

Faber, V. (1994), “Clustering and the continuous k-means algorithm,” URL http://www.fas.org/sgp/othergov/doe/lanl/pubs/00412967.pdf.

Liu, C.,Wheeler, B. C., O’Brien, Jr.,W. D., Bilger, R. C., Lansing, C. R., Jones, D. L., and Feng, A. S. (2001), “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am. 110(6), 3218–3231.

Madhu, N., Breithaupt, C., andMartin, R. (2008), “Temporal smoothing of spectral masks in the cepstral domain for speech separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Madhu, N. and Martin, R. (2008), “A scalable framework for multiple speaker localization and tracking,” in Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC) (Seattle, USA).

McLachlan, G. and Peel, D. (2007), “Mixture models and neural networks for clustering,” URL http://en.scienti ccommons.org/43159010.

Rickard, S. and Yilmaz, ̈O. (2002), “On the approximate W-Disjoint orthogonality of speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Roman, N., Wang, D., and Brown, G. (2003), “Speech segregation based on sound localization,” J. Acoust. Soc. Am. 114, (4), 2236 – 2252.

Tashev, I. and Acero, A. (2006), “Microphone array post-processing using instantaneous direction of arrival,” in Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC).

van Trees, H. L. (2002), Detection, Estimation and Modulation Theory, Part IV (John Wiley and Sons).

Wang, D. (2008), “Time–frequency masking for speech separation and its potential for hearing aid design,” Trends in ampli cation , 332–353.

Yilmaz, ̈O., Jourjine, A., and Rickard, S. (2000), “Blind separation of disjoint orthogonal signals: Demixing N sources from two mixtures,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Yilmaz, ̈O., Jourjine, A., and Rickard, S. (2004), “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing 52.

Yoon, B.-J., Tashev, I., and Acero, A. (2007), “Robust adaptive beamforming algorithm using instantaneous direction of arrival with enhanced noise suppression capability,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Additional Files

Published

2009-12-15

How to Cite

Madhu, N. (2009). Data-driven mask generation for source separation. Proceedings of the International Symposium on Auditory and Audiological Research, 2, 213–222. Retrieved from https://proceedings.isaar.eu/index.php/isaarproc/article/view/2009-22

Issue

Section

2009/2. Perceptual measures and models of spatial hearing