Research Themes

Research Theme 1: Bridging the Gap between Recognition and Synthesis (RS)


An increasing number of current text-to-speech techniques are borrowed from speech recognition, particularly
in areas such as the automatic segmentation of corpora for speech synthesis, and the development of the
trajectory HMMs for speech synthesis. Indeed, the trajectory HMM is a good example of the research we
have in mind, since the standard HMM as used in ASR was modified specifically for speech synthesis, and the
modified form is now being reapplied to ASR. At the same time there is an increasing interest in links between
basic concepts underlying unit selection synthesis and the way in which speech and language are represented
and processed in the human brain. There is mounting evidence that we do not produce speech by arranging
neural commands for individual movement of individual articulators in the right order, and that we do not
parse acoustic input into phonemes, only to assemble them into meaningful words and phrases. Rather, there
is evidence that speech production and perception involves the activation of longer stretches of speech that are
represented in the brain on a hierarchy of levels of abstraction. Thus, both in speech synthesis and recognition
research there is an interest in efficient storage and retrieval of units with the size of words and phrases.
In SCALE we aim to further develop these commonalities to improve both recognition and synthesis, and to
provide richer, more sophisticated models of the speech signal, without sacrificing trainability and the ability to
estimate the model parameters from (potentially very small amounts of) data. We focus on trajectory models,
reactive speech synthesis, and more sophisticated signal models. In order to be able to test novel methods and
algorithms and to make these available for medium-term commercial applications we will use state-of-the-art
systems for both recognition and synthesis that the SCALE partners have built in-house and that we therefore
can adapt to the results of our R&D efforts.

Specific topics are:
1. RS-1: Trajectory HMMs for Reactive Speech Synthesis;
2. RS-2: Towards Speaker Invariance: Compensating for Coarticulation;
3. RS-3: Hierarchical Trajectory Models for Speech Recognition;
4. RS-4: Speech Synthesis by Analysis.

Research Theme 2: Bridging the Gap between ASR and HSR (AHSR)


Current ASR systems have been shown to be somewhat robust to various distortions of the input signal relative
to the training data, but even the most powerful systems fall short dramatically compared with human performance.
It is generally agreed that the gap between human and machine performance can only be bridged by
introducing those aspects of human processing which make the difference. We believe that human speech processing
is a process on several interacting hierarchical levels. In this research theme we aim to improve the
performance of automatic speech processing by developing and testing a hierarchical architecture. At the same
time, we intend to advance our understanding of human speech processing by analysing the improvements in
automatic processing obtained by introducing concepts from human processing.
Both human and automatic speech recognition deal with the transformation of a speech signal into a sequence
of lexical tokens (words), and use concepts such as lexicon, search, word representations, context dependency
and word activation. Computational models in both ASR and HSR must support adaptation to previously
unheard patterns at the level of articulatory features, phonemes, words and phrases. It is well known
that adults find it very difficult to learn distinctions between articulatory features and phonemes that do
not occur in their native language(s), while they have little difficulty in learning and processing new words and
phrases.

While ASR for languages such as English, which has relatively simple morphological structure, can be approached
by assuming that there is a fixed lexicon that contains all possible words, such an approach is doomed
to fail with languages such as Dutch, German, Turkish and Arabic, which have a much more complex morphology.
For languages with a complex morphology it is necessary to add an extra level to the decoder, where
subword units will be combined to words in the lexicon or to new words that can be created from the lexicon
and the morphological rules of the language. Obviously, this additional component of the decoder needs to
integrate hypotheses about subword units resulting from bottom-up signal processing and detection of out-ofvocabulary
words which relies very much on top-down verification.
The efficiency and robustness of speech communications is to a large extent due to the fact that speech signals
are highly redundant. All speech sounds are characterised by a large number of articulatory and acoustic
features. Consequently, humans can recognise the sounds even if part of the features are distorted due to coarticulation
or background noise. However, little is known about the ways in which humans detect and process
the relevant features, especially in adverse acoustic conditions. Yet, it is reasonable to assume that the processes
which allow us to separate speech from background signals are similar to the processes investigated in
computational acoustic scene analysis. Therefore, we will develop rich and redundant acoustic representations
of speech signals, including such features as harmonicity, onset time and location in conjunction with
adaptive machine learning techniques to discover which features are relevant under specific environmental
conditions. It goes without saying that the back-end decoder must be adapted to be able to process the novel
features.

Although adults may not be able to learn to speak new languages without some remaining foreign accent,
they are able to acquire representations of the acoustic and articulatory features of a new language at level
that is sufficient for effective communication. We shall investigate new approaches towards representing
the acoustic signal via context- and language-invariant sub-word representations. In bridging the gap
between human and automatic processing we will focus on the computational processes that both humans
and machines must perform. These representations will be derived from an acoustic stream without using
a language model or phonotactic and grammatical rules of a particular language, but may employ universal
(language-independent) constraints on speech production. The goal will be achieved through defining and extracting
a new powerful physiologically and psychophysically motivated set of language- and task-independent
acoustic descriptors (features), reflecting posterior probabilities of speech-specific (and language- and talkerindependent)
sub-word classes and features.
When trying to discover ’natural’ units for representing speech, we expect to find a complex hierarchically
organised set of units, some with the size of full syllables, others with the size of sub-phonemic phenomena.
Words and phrases can be composed of these basic units in many different manners, depending on speaker,
speaking style, prosody, etc. While the highest levels of the hierarchical representations encode the invariant
linguistic features of words and phrases, the lower levels in the hierarchy encode speaker, style and context
effects. To be able to process such a complex hierarchical representation efficiently we will need some kind
of associative memory, in which partial representations of the input signals on the lower levels activate representations
of larger linguistic units, which can then be verified with limited computational power.
Interestingly, a representation that models speech in the form of multi-level units will evidently affect both
speech synthesis and recognition.

To reach the targets sketched in the previous paragraphs we will conduct the following research projects:
1. AHSR-1: Towards Open Vocabulary Speech Recognition;
2. AHSR-2: Data Association Multisource Acoustic Models;
3. AHSR-3: Sounds and Spoken Language;
4. AHSR-4: Associative Memories for Learning and Decoding Speech.

Research Theme 3: Bridging the Gap between Signal Processing and Learning (SPL)


The acoustic models of current ASR systems are based on trainable statistical models, usually HMMs. These
model parameters are estimated using automatic machine learning techniques, such as the EM algorithm,
whereas the signal processing component of most systems is essentially non-adaptive. We propose to investigate
adaptive learning approaches to signal processing and its interface to statistical acoustic models.
By dividing the primary signal processing level into multiple, possibly heterogeneous parts, through parallel
recording channels, different feature extraction methods, or specific signal oriented modelling, learning could
be introduced to signal processing. Learning then would be concerned with separating sources of variability
(e.g. localization using multiple microphone systems), or with integration of signal properties from different
features (feature and/or model combination), or by providing models with the potential to capture information
for which HMMs are not well suited (conditional random fields, multistream approaches).
Multiple microphone sensors provide a way to localize and enhance acoustic sources. We are particularly interested
in algorithms for improved acoustic signal capture from uncalibrated arrays of microphones to deliver
improved automatic recognition. This goal requires minimizing the detrimental effects of reverberation and
background noise. Most current approaches to this challenge, such as superdirective beamforming, do not
frame the problem in an adaptive learning setting. We plan to investigate two basic approaches to adaptive
signal processing in this context. Firstly, it is well-known that human speech, viewed in either the time or
frequency domain, is a non-Gaussian signal. Conventional beamforming, however, treats all signals, both
sources and interferences, as Gaussian. We plan to develop beamforming techniques specifically tailored
to the automatic recognition of far-field speech data that explicitly exploit this non-Gaussian nature to distinguish
desired speech sources from noise and other interference, such as reverberation. Secondly, most current
approaches to microphone array processing assume some knowledge of the array geometry; we propose to
investigate approaches to microphone array processing based on ad-hoc collections of sensors with unknown
positions, both relative and absolute.

With current technology, ASR systems using far-field sensors typically have an error rate 2-3 times as high
as that provided by the close-talking microphone. The final goal of our research on speech processing with
multiple microphones will be to bridge the gap in performance attainable with such far-field sensors and with
close-talking microphones. ASR systems usually have different strengths and deficiencies leading to complementary
results, meaning that systems usually do not produce the same errors, or errors at the same positions. Therefore, system
combination in principle can take advantage of the individual strengths of different systems. It has been shown that
combinations of different systems with methods like ROVER, Confusion Network Combination (CNC), Discriminative
Model Combination (DMC), or cross-system speaker adaptation consistently leads to improved
performance. In SCALE, we propose a systematic study of a variety of approaches to system combination.
The aim is to find further approaches with improved automatic selection capabilities to reduce the empirical
optimization effort during system combination.

The individual research projects falling within the first research theme are:
1. SPL-1: Non-Gaussian Beamforming for Far-Field ASR;
2. SPL-2: Particle Filters for Robust Far-Field ASR;
3. SPL-3: Multi-Channel Modelling for Automatic Speech Recognition;
4. SPL-4: Investigations on Feature and System Combination Methods;
5. SPL-ER-1: Multiple Microphone Techniques for Dereverberation;
6. SPL-ER-2: Multi-Microphone Beamforming and Noise Reduction Using Auditory Processing.