Data-driven mask generation for source separation
Resumé
Presented is a microphone-array based approach for the extraction of a target signal from a mixture of compteting sources and background noise. The approach builds upon a recent proposal for source localization and tracking in the general M-microphone Q-source case, and extends it to a versatile framework to perform source separation using data-driven soft– or hard– masks. The proposed approach is applicable to any arbitrary array – allowing for its integration into binaural hearing aids. The advantage of the proposed mask generation, in contrast to current algorithms, is the implicit scalability with respect to M, Q, source spread and the amount of reverberation – obviating the need for a heuristic adaptation of the mask generation algorithm in different acoustical scenarios. Further, the individual signals extracted using these soft-masks evince low amounts of musical noise. Additional mask smoothing may be performed to further reduce the musical noise phenomenon, thereby improving the listening experience.
Referencer
Bilmes, J. A. (1998), “A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models,” Tech. Rep. TR-97-021, U.C. Berkeley.
Bodden, M. (1992), “Binaurale Signalverarbeitung: Modellierung der Richtungs- erkennung und des Cocktail-Party-Effektes,” Ph.D. thesis, Institute of Communi- cation Acoustics, Ruhr-Universität Bochum.
Breithaupt, C., Madhu, N., Hummes, F., and Martin, R. (2005), “A robust steerable realtime multichannel noise reduction system,” in 2005 Joint Workshop on Hands- Free Speech Communication and Microphone Arrays (HSCMA2005).
Faber, V. (1994), “Clustering and the continuous k-means algorithm,” URL http://www.fas.org/sgp/othergov/doe/lanl/pubs/00412967.pdf.
Liu, C.,Wheeler, B. C., O’Brien, Jr.,W. D., Bilger, R. C., Lansing, C. R., Jones, D. L., and Feng, A. S. (2001), “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am. 110(6), 3218–3231.
Madhu, N., Breithaupt, C., andMartin, R. (2008), “Temporal smoothing of spectral masks in the cepstral domain for speech separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Madhu, N. and Martin, R. (2008), “A scalable framework for multiple speaker localization and tracking,” in Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC) (Seattle, USA).
McLachlan, G. and Peel, D. (2007), “Mixture models and neural networks for clustering,” URL http://en.scienti ccommons.org/43159010.
Rickard, S. and Yilmaz, ̈O. (2002), “On the approximate W-Disjoint orthogonality of speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Roman, N., Wang, D., and Brown, G. (2003), “Speech segregation based on sound localization,” J. Acoust. Soc. Am. 114, (4), 2236 – 2252.
Tashev, I. and Acero, A. (2006), “Microphone array post-processing using instantaneous direction of arrival,” in Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC).
van Trees, H. L. (2002), Detection, Estimation and Modulation Theory, Part IV (John Wiley and Sons).
Wang, D. (2008), “Time–frequency masking for speech separation and its potential for hearing aid design,” Trends in ampli cation , 332–353.
Yilmaz, ̈O., Jourjine, A., and Rickard, S. (2000), “Blind separation of disjoint orthogonal signals: Demixing N sources from two mixtures,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Yilmaz, ̈O., Jourjine, A., and Rickard, S. (2004), “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing 52.
Yoon, B.-J., Tashev, I., and Acero, A. (2007), “Robust adaptive beamforming algorithm using instantaneous direction of arrival with enhanced noise suppression capability,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Yderligere filer
Publiceret
Citation/Eksport
Nummer
Sektion
Licens
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright* and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
*From the 2017 issue onward. The Danavox Jubilee Foundation owns the copyright of all articles published in the 1969-2015 issues. However, authors are still allowed to share the work with an acknowledgement of the work's authorship and initial publication in this journal.