Chair of
Multimedia Communications and Signal Processing
Prof. Dr.-Ing. André Kaup

Reverberation modeling in speech recognition

Field of activity: Audio and Acoustic Signal Processing
Research topic: Signal Improvement and Detection
Staff: Prof. Dr.-Ing. Walter Kellermann

Distant-Talking Automatic Speech Recognition

We know Automatic Speech Recognition (ASR) from various applications in everyday life, e.g., dictation systems or telephone hotlines. Those ASR systems work in general very reliably as the user wears a close-talking microphone. In some applications however, the use of close-talking microphones is very inconvenient and distant microphones, e.g., installed at the ASR device itself, must be employed instead. Examples are the automatic transcription of meetings or the voice-control of television sets (Fig.1), humanoid robots, and other electronic devices.

TV
Fig.1: Control of an interactive TV by speech commands.


As in such distant-talking scenarios, the speaker is usually one to several meters away from the microphone, the received signal is distorted by additive interferences, like background noise and competing speakers, and by the reverberation of the desired signal caused by multipath propagation of sound waves in enclosures (Fig.2). The reverberation essentially changes the statistical properties of the speech signal, and thus reduces the recognition rate of state-of-the-art ASR systems if no measures for increasing robustness are taken.


Reveration
Fig.2: Illustration of ASR in reverberant environments
(desired signal, reverberation).

State-of-the-Art

The most common approach to handle reverberation is to train the recognizer on reverberant speech data, e.g., recorded in the target room. However, if such an ASR system is then to be used in a different acoustic environment, an entire retraining of the recognizer is necessary, which usually causes a major computational effort.

The REMOS concept - REverberation MOdeling for Speech recognition

REMOS (REverberation MOdeling for Speech recognition) is a generic framework for robust distant-talking ASR. The key idea of REMOS is to consider a reverberant utterance in the time-frequency domain as the convolution of the clean utterance and the time-frequency representation of the room impulse response (Fig.3).

Convolution
Fig.3: Illustration of the convolution in the time-frequency domain.


During recognition, REMOS inverts this convolution by separating each reverberant utterance into its clean part and the contribution of the room impulse response. To this end, a statistical reverberation model, which describes the acoustics of the target room, has to be estimated once prior to recognition. Fig.4 shows an example of a clean, a reverberant utterance, and an utterance that has been dereverberated with REMOS.

Klar
Fig.4a: Clean utterance "one, one, eight".

 

Verhallt
Fig.4b: Reverberant utterance "one, one, eight".

 

Enthallt
Fig.4c: REMOS-dereverberated utterance "one, one, eight".


The advantage of REMOS compared to other state-of-the-art ASR systems is that changing reverberation conditions, e.g., room changes, do not require an entire retraining of the recognizer. It suffices to reestimate the statistical reverberation model to adapt REMOS to the new environment. Since this can be done efficiently, REMOS is a very flexible framework for reverberant speech recognition.

Funding

The LMS Chair gratefully acknowledges the funding by the Deutsche Forschungsgemeinschaft (DFG) for "Reverberation modeling for robust speech recognition in reverberant environments" under project number KE 890/4-1.

Publications

2016-70 K. Kinoshita, M. Delcroix, S. Gannot, E.A.P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, T. Yoshioka
   [bib]

REVERB challenge: A benchmark task for reverberation-robust ASR techniques
New Era for Robust Speech Recognition: Exploiting Deep Learning, Editor(s): S. Watanabe, M. Delcroix, F. Metze and J. R. Hershey , Springer (in press), 2016
2016-18
CRIS
Sam Nees, A. Schwarz, W. Kellermann
   [link]   [bib]

Dereverberation Using a Model for the Spatial Coherence of Decaying Reverberant Sound Fields in Rectangular Rooms
Conference: AES 60th International Conference: DREAMS (Dereverberation and Reverberation of Audio, Music, and Speech), Pages: 1--8, Feb. 2016
2015-47
CRIS
H. Barfuss, C. Hümmer, A. Schwarz, W. Kellermann
   [link]   [bib]

Robust coherence-based spectral enhancement for distant speech recognition
available on arXiv.org, Dec. 2015
2015-44
CRIS
R. Maas, C. Hümmer, A. Sehr, W. Kellermann
   [link]   [doi]   [bib]

A Bayesian view on acoustic model-based techniques for robust speech recognition
EURASIP Journal on Advances in Signal Processing (JASP) Online Publication, Num. 2015:103, Pages: 1--16, Dec. 2015
2015-35
CRIS
Sam Nees, A. Schwarz, W. Kellermann
   [link]   [doi]   [bib]

A model for the temporal evolution of the spatial coherence in decaying reverberant sound fields
Journal of the Acoustical Society of America (JASA) Vol. 138, Online Publication, Num. 3, Pages: EL248--EL253, Sep. 2015
2015-18
CRIS
A. Schwarz, C. Hümmer, R. Maas, W. Kellermann
   [link]   [bib]

Real-Time Dereverberation for Deep Neural Network Speech Recognition
DAGA 2015, Pages: 139--142, Nürnberg, Germany, Mar. 2015
2015-17
CRIS
A. Schwarz, W. Kellermann
   [link]   [doi]   [bib]

Coherent-to-Diffuse Power Ratio Estimation for Dereverberation
IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP) Vol. 23, Online Publication, Num. 6, Pages: 1006--1018, Apr. 2015
2015-1
CRIS
A. Schwarz, C. Hümmer, R. Maas, W. Kellermann
   [link]   [bib]

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Pages: 4380--4384, Brisbane, Australia, Apr. 2015
2014-28
CRIS
A. Schwarz, W. Kellermann
   [pdf]   [link]   [bib]

Unbiased Coherent-to-Diffuse Ratio Estimation for Dereverberation
International Workshop on Acoustic Signal Enhancement (IWAENC), Pages: 6--10, Antibes - Juan les Pins, France, Sep. 2014
2014-17
CRIS
A. Sehr, H. Barfuss, C. Hofmann, R. Maas, W. Kellermann
   [doi]   [bib]

Efficient Training of Acoustic Models for Reverberation-Robust Medium-Vocabulary Automatic Speech Recognition
Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), Pages: 177-181, Nancy, France, May 2014
2013-55
CRIS
K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, B. Raj
   [bib]

The REVERB Challenge: A Common Evaluation Framework for Dereverberation and Recognition of Reverberant Speech
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Pages: 1--4, New Paltz, NY, USA, Oct. 2013
2013-14
CRIS
R. Maas, A. Sehr, T. Yoshioka, M. Delcroix, K. Kinoshita, T. Nakatani, W. Kellermann
   [bib]

Formulation of the REMOS Concept from an Uncertainty Decoding Perspective
International Conference on Digital Signal Processing (DSP), Pages: 1--6, Santorini, Greece, Jul. 2013
2013-13
CRIS
C. Hofmann, A. Sehr, S. Sahoo, R. Maas, W. Kellermann
   [bib]

New Results on Automatic Speech Recognition in Extremely Reverberant Environments
AIA-DAGA 2013 Conference on Acoustics, Pages: 2059--2062, Merano, Italy, Mar. 2013
2013-12
CRIS
R. Maas, A. Thippur, A. Sehr, W. Kellermann
   [bib]

An Uncertainty Decoding Approach to Noise- and Reverberation-Robust Speech Recognition
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Pages: 7388--7392, Vancouver, Canada, May 2013
2012-58
CRIS
T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, W. Kellermann
   [bib]

Survey on Approaches to Speech Recognition in Reverberant Environments
APSIPA Annual Summit and Conference, Pages: 1--4, Los Angeles, USA, Dec. 2012
2012-37
CRIS
T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, W. Kellermann
   [bib]

Making Machines Understand Us in Reverberant Rooms: Robustness against Reverberation for Automatic Speech Recognition
IEEE Signal Processing Magazine Vol. 29, Num. 6, Pages: 114--126, Nov. 2012
2012-17
CRIS
R. Maas, Sujan R. Kotha, A. Sehr, W. Kellermann
   [bib]

Combined-Order Hidden Markov Models for Reverberation-Robust Speech Recognition
Proc. Int. Workshop on Cognitive Information Processing (CIP), Pages: 167--171, Baiona, Spain, May 2012
2012-15
CRIS
R. Maas, A. Schwarz, K. Reindl, Y. Zheng, S. Meier, A. Sehr, W. Kellermann
   [bib]

Matching the Acoustic Model to Front-End Signal Processing for ASR in Noisy and Reverberant Environments
Proc. Deutsche Jahrestagung für Akustik (DAGA), Pages: 637--638, Darmstadt, Germany, Mar. 2012
2012-14
CRIS
R. Maas, E.A.P. Habets, A. Sehr, W. Kellermann
   [bib]

On the Application of Reverberation Suppression to Robust Speech Recognition
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Pages: 297--300, Kyoto, Japan, Mar. 2012
2011-44
CRIS
R. Maas, A. Schwarz, Y. Zheng, K. Reindl, S. Meier, A. Sehr, W. Kellermann
   [link]   [bib]

A Two-Channel Acoustic Front-End for Robust Automatic Speech Recognition in Noisy and Reverberant Environments
International Workshop on Machine Listening in Multisource Environments (CHiME), Pages: 41--46, Florence, Italy, Sep. 2011
2011-28
CRIS
A. Sehr, C. Hofmann, R. Maas, W. Kellermann
   [doi]   [bib]

Multi-Style Training of HMMs with Stereo Data for Reverberation-Robust Speech Recognition
Proc. Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), Pages: 196--199, Edinburgh, UK, May 2011
2011-27
CRIS
R. Maas, M. Wolf, A. Sehr, C. Nadeu, W. Kellermann
   [doi]   [bib]

Extension of the REMOS Concept to Frequency-Filtering-Based Features for Reverberation-Robust Speech Recognition
Proc. Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), Pages: 13--16, Edinburgh, UK, May 2011
2011-19
CRIS
A. Sehr, R. Maas, W. Kellermann
   [doi]   [bib]

Frame-Wise HMM Adaptation Using State-Dependent Reverberation Estimates
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Pages: 5484--5487, Prague, Czech Republic, May 2011
2011-6
CRIS
R. Maas, A. Sehr, W. Kellermann
   [bib]

Reverberation Modeling for Robust Speech Recognition
Deutsche Jahrestagung für Akustik (DAGA), Pages: 217--218, Düsseldorf, Germany, Mar. 2011
2010-87
CRIS
R. Maas, A. Sehr, W. Kellermann
   [bib]

Multi-Style Reverberation Models and Efficient Model Adaptation for Robust Distant-Talking Speech Recognition with REMOS
ITG Conference on Speech Communication, Pages: 28, Bochum, Germany, Oct. 2010
2010-84
CRIS
A. Sehr, R. Maas, W. Kellermann
   [doi]   [bib]

Model-Based Dereverberation in the Logmelspec Domain for Robust Distant-Talking Speech Recognition
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Pages: 4298-4301, Dallas, USA, Mar. 2010
2010-83
CRIS
R. Maas, A. Sehr, M. Gugat, W. Kellermann
   [link]   [bib]

A Highly Efficient Optimization Scheme for REMOS-Based Distant-Talking Speech Recognition
Proc. European Signal Processing Conference (EUSIPCO), Pages: 1983-1987, Aalborg, Denmark, Aug. 2010
2010-61
CRIS
A. Sehr
   [link]   [bib]

Reverberation Modeling for Robust Distant-Talking Speech Recognition
Verlag Dr. Hut, München, 2010
2010-55
CRIS
A. Sehr, C. Hofmann, R. Maas, W. Kellermann
   [link]   [bib]

A Novel Approach for Matched Reverberant Training of HMMs using Data Pairs
Proc. INTERSPEECH, Pages: 566-569, Makuhari, Japan, Sep. 2010
2010-54
CRIS
A. Sehr, E.A.P. Habets, R. Maas, W. Kellermann
   [bib]

Towards a Better Understanding of the Effect of Reverberation on Speech Recognition Performance
Proc. International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel Aviv, Israel, Aug. 2010
2010-53
CRIS
A. Sehr, W. Kellermann
   [bib]

On the Statistical Properties of Reverberant Speech Feature Vector Sequences
Proc. International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel Aviv, Israel, Aug. 2010
2010-19
CRIS
A. Sehr, R. Maas, W. Kellermann
   [doi]   [bib]

Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition
IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP) Vol. 18, Num. 7, Pages: 1676-1691, Sep. 2010
2009-64
CRIS
A. Sehr, W. Kellermann
   [doi]   [bib]

Strategies for modeling reverberant speech in the feature domain
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Pages: 3725-3728, Apr. 2009
2009-58
CRIS
A. Sehr, M. Gardill, W. Kellermann
   [link]   [bib]

Adapting HMMs of Distant-Talking ASR Systems Using Feature-Domain Reverberation Models
Proc. European Signal Processing Conference (EUSIPCO), Pages: 540-543, Glasgow, Scotland, Aug. 2009
2009-57
CRIS
Jimi Y. C. Wen, A. Sehr, Patrick A. Naylor, W. Kellermann
   [link]   [bib]

Blind estimation of a feature-domain reverberation model in non-diffuse environments with variance adjustment
Proc. European Signal Processing Conference (EUSIPCO), Pages: 175-178, Glasgow, Scotland, Aug. 2009
2008-58
CRIS
A. Sehr, W. Kellermann
   [link]   [bib]

A Simplified Decoding Method for a Robust Distant-Talking ASR Concept Based on Feature-Domain Dereverberation
Proc. Int. Workshop on Acoustic Echo and Noise Control (IWAENC), Seattle, Washington, USA, Sep. 2008
2008-57
CRIS
A. Sehr, Jimi Y. C. Wen, W. Kellermann, Patrick A. Naylor
   [link]   [bib]

A Combined Approach for Estimating a Feature-Domain Reverberation Model in Non-diffuse Environments
Proc. Int. Workshop on Acoustic Echo and Noise Control (IWAENC), Seattle, Washington, USA, Sep. 2008
2008-54
CRIS
A. Sehr, W. Kellermann
   [doi]   [bib]

Model-Based Dereverberation of Speech in the Mel-Spectral Domain
Proc. Asilomar Conference on Signal, Systems, and Computers, Oct. 2008
2008-53
CRIS
A. Sehr, W. Kellermann
   [doi]   [bib]

New Results for Feature-Domain Reverberation Modeling
Proc. Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) 2008, Trento, Italy, May 2008
2008-31 A. Sehr, W. Kellermann
   [link]   [doi]   [bib]

Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments
Speech and Audio Processing in Adverse Environments, Editor(s): E. Hänsler, G. Schmidt, Pages: 679-728, Springer, Berlin, 2008
2007-69
CRIS
A. Sehr, Y. Zheng, E. Nöth, W. Kellermann
   [link]   [bib]

Maximum likelihood estimation of a reverberation model for robust distant-talking speech recognition
Proc. European Signal Processing Conference (EUSIPCO), Pages: 1299-1303, Poznan, Poland, Sep. 2007
2006-67
CRIS
A. Sehr, M. Zeller, W. Kellermann
   [link]   [bib]

Distant-talking continuous speech recognition based on a novel reverberation model in the feature domain
Proc. INTERSPEECH 2006, Pages: 769-772, Pittsburgh, PA, USA, Sep. 2006
2006-39
CRIS
A. Sehr, M. Zeller, W. Kellermann
   [link]   [bib]

Hands-free speech recognition using a reverberation model in the feature domain
Proc. European Signal Processing Conference (EUSIPCO), Florence, Italy, Sep. 2006
2006-23
CRIS
A. Sehr, O. Gress, W. Kellermann
   [bib]

Synthetisches Multicondition-Training zur robusten Erkennung verhallter Sprache
Proc. ITG Fachtagung Sprachkommunikation, Kiel, Germany, Apr. 2006