Chair of
Multimedia Communications and Signal Processing
Prof. Dr.-Ing. André Kaup

DICIT

EU-Projekt DICIT - Distant-talking Interfaces for Control of Interactive TV

Dicit Logo


DICIT was a European Union-funded project whose main objective was to integrate distant-talking voice interaction as a complementary modality to the use of a remote control in interactive TV systems. To free the user from constraints as imposed by standard close-talking systems with very limited vocabularies, the DICIT system realizes hands-free, seamless, and intiutive control - this enables a natural user-system interaction providing a suitable means to greatly ease information retrieval. Thus, multiple and possibly moving users can use their voice for controlling the TV, e.g., requesting information about an upcoming program and scheduling its recording without the need for any hand-held or head-mounted gear.
To ensure a robust performance under the adverse conditions of a typical living room scenario and maximize speech recognition performance, the system requires real-time-capable acoustic signal processing techniques which compensate for the impairments of the desired speech signals by acoustic echoes from the loudspeakers, local interferers, ambient noise and reverberation.

Interactive TV Scenario
This requirement defined the role of the LMS Audio Group within the project, which was responsible for the development of the multichannel acoustic front-end (MCAF) as one of the key components for the DICIT prototypes.

The MCAF foresees a combination of state-of-the-art
red arrow multichannel acoustic echo cancellation (MC-AEC),
light green arrow beamforming (BF),
blue arrow blind source separation (BSS),
dark green arrow multiple source localization (SLOC),
and smart speech filtering (SSF) based on acoustic event detection and classification.

While FBK-irst delivered the SSF and traditional correlation-based source localization component, the LMS Audio Group provided the technology for MC-AEC, beamforming and BSS and was also responsible for the integration of the acoustic signal processing modules into the MCAF.

The figure on above schematically illustrates the challenges of the DICIT Interactive TV scenario and the related solutions realized by the acoustic signal processing within the multichannel acoustic front-end;
While MC-AEC suppresses the loudspeaker echoes, the beamforming unit steers a beam of increased sensitivity to the desired source (look-direction &gamma). It thereby suppresses interferers, ambient noise, and reverberation, originating from other directions than the look-direction, which is provided by the SLOC component. BSS as an alternative spatial processing to beamforming with multiple beams enables the extraction of several simultaneously active sources and/or interferers. Since BSS yields the SLOC-information itself, there is no need for an a-priori source position knowledge. Finally, a continuous data stream is analyzed by the SSF, which ideally only carries the commands uttered by the desired source(s) and is cleared from all signal impairments. The SSF is then responsible for extracting these commands from the data stream and sending them to the automatic speech recognizer (ASR).

 

DICIT prototype system architecture and multichannel acoustic front-end realizations

The fully functional DICIT prototype as depicted in the diagram below consists of the following modules:

  • Signal acquisition and playback: &nbsp Comprises a 13-channel microphone array, a stereo loudspeaker system and a set-top box platform, providing access to on-air satellite signals
  • BF-/SLOC-based multichannel acoustic front-end : &nbsp Extracts desired speech commands (see description below the figure for more details)
  • Automatic speech recognition (ASR) &nbsp and &nbsp Natural language understanding (NLU): &nbsp Converts speech to text before interpreting textual contents
  • Dialogue manager: &nbsp Responsible for managing all user-/system interactions and interfacing to external data and devices
Blockdiagram DICIT prototype

BF-/SLOC-based multichannel acoustic front-end
The DICIT prototype is based on an acoustic front-end which efficiently combines stereo acoustic echo cancellation (SAEC), beamforming, traditional correlation-based source localization, and smart speech filtering. The front-end and its connection to the signal acquisition and playback stage are depicted in the left figure below.
While BF extracts the speech signal originating from the desired look direction with minimum distortion and suppresses unwanted noise and interference, AEC compensates for the acoustic coupling between loudspeakers and sensors. Since the scenario implies an almost unconstrained and possibly time-varying user position, an according adaptive BF structure was employed. Since applying SAEC to all 13 microphone signals is computationally too expensive, SAEC was placed behind the BF structure. A set of five data-independent beamformers is computed in parallel, which cover possible speaker positions and track moving users by switching between beams. Thereby, the AECs do not need to track time-varying beamformers. Instead of one SAEC behind each beamformer output, only one SAEC is calculated for the beam covering the source of interest. Assuming that beam-switches occur infrequently, the necessary readaptation of the SAEC filter coefficients is acceptable. The reuse of AEC filter coefficients determined for previously selected beamformers further reduces the impact of occasionally switching beams.
The selection of the beamformer output to be passed to the SAEC is made by the source localization. As the SLOC needs to use microphone signals which still contain acoustic echoes of the TV audio signals, a-priori knowledge on the loudspeaker positions has to be exploited to exclude TV loudspeakers as sources of interest. Finally, the SSF module analyzes the output of the SAEC in order to detect speech segments from the user. For a robust system it is crucial that only the desired speech segments and no nonstationary noise or echo residuals will be passed to the ASR - the corresponding decision is supported by the SLOC information.

BSS-based multichannel acoustic front-end
As an alternative configuration the BSS-based front-end introduces blind source separation as an alternative spatial processing. The front-end is depicted in the right figure below.
Since BSS can be interpreted as a set of adaptive null-beamformers, it replaces the functionality of data-independent beamformers and source localization of the BF-/SLOC-based approach. One major advantage of the BSS-based front-end is the reduction of the number of microphones. For the envisaged BSS-based front-end only two sensors will be needed, which is supposed to be of great importance with respect to the overall system complexity, user acceptance and cost. A second benefit is that, in contrast to the BF-/SLOC-based front-end which can currently extract only one active user, BSS using two sensor signals is also able to extract two simultaneously speaking users. In any case two streams of data will be delivered to the following SSF module, carrying the following signals:

  • If no user is active, two zero-valued signals arrive at the SSF component
  • If one user is active, its signal will appear in one SSF input and will be attenuated in the other SSF input
  • If two users are simultaneously active, each SSF input will be dominated by one user signal


The AEC is performed directly on the microphone inputs, before passing the outputs to the BSS. The SLOC module represents an additional source of information and might supplement the BSS-inherent source localization and thus also help to improve the decisions to be made by the SSF. SSF first processes the two input streams provided by BSS in order to detect speech segments and reject any non-speech event by means of an acoustic event classifier. Moreover, because SSF here has to work on more than one input stream it is likely that two simultaneously active speakers will create two streams with valid speech segments. Therefore it must be decided which speech signal to pass to the ASR and which one to reject. This decision could be based on a speaker identification.

Signal acquisition and playback stage
& BF-/SLOC-based acoustic front-end
BSS-based acoustic front-end
Blockdiagram Signal acquisition and playback and BF-/AEC-based front-end Blockdiagram BSS-based front-end

 

The DICIT prototype in action - Demo at LMS


In the video you can see a live demonstration of the final DICIT prototype, recorded in the Multimedia Room at LMS.
The clip shows the stereo loudspeaker system and the 13-channel microphone array around the projection on the screen, while the user is giving commands to the system. In order to get an impression of the acoustic difficulty of the scene, one of the array microphone signals is presented before listening to the input of the automatic speech recognizer: While the echo of the stereo loudspeaker system (and noise from other directions than the user's) is strongly attenuated, the user's commands are well audible. To make the conversation between the users intelligible for the viewer of this video, an audio track with amplified users' voices has been chosen for the final part of the clip. (In such a case of speech, which is not directed to the system but to other persons in the room, it is possible to disable the speech recognizer via remote control.)
Despite the critical environment, the natural user commands (e.g., "what can I do?", "The volume is still too low") and the fact that the system has not been trained on the voices of the involved users but is based on a corpus for American English, the system manages to fulfill all the users' requests. The superiority of voice control in comparison to a normal remote control becomes especially obvious when navigating a dialog or scheduling a program (e.g. "I'd like to watch a talk show on Thursday", "Sport Today").

You can watch this video in high definition on YouTube


Publications:

2011-57 W. Kellermann, Y. Zheng
   [bib]

Method and apparatus for blind source separation improving interference estimation in binaural Wiener filtering
EP 00 0002 211 563 B1, Aug. 2011
2010-84
CRIS
A. Sehr, R. Maas, W. Kellermann
   [doi]   [bib]

Model-Based Dereverberation in the Logmelspec Domain for Robust Distant-Talking Speech Recognition
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Pages: 4298-4301, Dallas, USA, Mar. 2010
2010-80 W. Kellermann, Y. Zheng
   [bib]

Method and apparatus for blind source separation improving interference estimation in binaural Wiener filtering
EP 00 0002 211 563 A1, Jul. 2010
2010-53
CRIS
A. Sehr, W. Kellermann
   [bib]

On the Statistical Properties of Reverberant Speech Feature Vector Sequences
Proc. International Workshop on Acoustic Echo and Noise Control (IWAENC), Tel Aviv, Israel, Aug. 2010
2010-41
CRIS
A. Brutti, L. Cristoforetti, W. Kellermann, L. Marquardt
   [link]   [doi]   [bib]

WOZ acoustic data collection for interactive TV
Language Resources and Evaluation Vol. 44, Num. 3, Pages: 205--219, Sep. 2010
2010-19
CRIS
A. Sehr, R. Maas, W. Kellermann
   [doi]   [bib]

Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition
IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP) Vol. 18, Num. 7, Pages: 1676-1691, Sep. 2010
2009-67
CRIS
L. Marquardt, P. Svaizer, E. Mabande, A. Brutti, C. Zieger, M. Omologo, W. Kellermann
   [doi]   [bib]

A natural acoustic front-end for Interactive TV in the EU-Project DICIT
Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim), Pages: 894--899, Victoria, Canada, Jan. 2009
2009-64
CRIS
A. Sehr, W. Kellermann
   [doi]   [bib]

Strategies for modeling reverberant speech in the feature domain
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Pages: 3725-3728, Apr. 2009
2009-58
CRIS
A. Sehr, M. Gardill, W. Kellermann
   [link]   [bib]

Adapting HMMs of Distant-Talking ASR Systems Using Feature-Domain Reverberation Models
Proc. European Signal Processing Conference (EUSIPCO), Pages: 540-543, Glasgow, Scotland, Aug. 2009
2009-27
CRIS
W. Brandhuber
   [bib]

On the Design of Unitary Filterbanks for the Construction of Orthonormal Wavelets
Vol. 24, Shaker Verlag, Aachen, Germany, Mar. 2009
2009-22
CRIS
E. Mabande, A. Schad, W. Kellermann
   [doi]   [bib]

Design of Robust Superdirective Beamformers as a Convex Optimization Problem
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Pages: 77-80, Taipei, Taiwan, Apr. 2009
2009-21
CRIS
E. Mabande, A. Schad, W. Kellermann
   [bib]

Robust Superdirectional Beamforming for Hands-Free Speech Capture in Cars
Proc. NAG/DAGA International Conference on Acoustics, Pages: 438-441, Rotterdam, Netherlands, Mar. 2009
2008-58
CRIS
A. Sehr, W. Kellermann
   [link]   [bib]

A Simplified Decoding Method for a Robust Distant-Talking ASR Concept Based on Feature-Domain Dereverberation
Proc. Int. Workshop on Acoustic Echo and Noise Control (IWAENC), Seattle, Washington, USA, Sep. 2008
2008-57
CRIS
A. Sehr, Jimi Y. C. Wen, W. Kellermann, Patrick A. Naylor
   [link]   [bib]

A Combined Approach for Estimating a Feature-Domain Reverberation Model in Non-diffuse Environments
Proc. Int. Workshop on Acoustic Echo and Noise Control (IWAENC), Seattle, Washington, USA, Sep. 2008
2008-54
CRIS
A. Sehr, W. Kellermann
   [doi]   [bib]

Model-Based Dereverberation of Speech in the Mel-Spectral Domain
Proc. Asilomar Conference on Signal, Systems, and Computers, Oct. 2008
2008-53
CRIS
A. Sehr, W. Kellermann
   [doi]   [bib]

New Results for Feature-Domain Reverberation Modeling
Proc. Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) 2008, Trento, Italy, May 2008
2007-11
CRIS
A. Sehr, W. Kellermann
   [doi]   [bib]

A new concept for feature-domain dereverberation for robust distant-talking ASR
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Pages: IV-369 - IV-372, Honolulu, HI, USA, Apr. 2007