A short introduction to speech recognition
[by Olivier Deroo]
Automatic speech recognition (ASR) is useful as a multimedia browsing tool: it allows us to easily search and index recorded audio and video data. Speech recognition is also useful as a form of input. It is especially useful when someone's hands or eyes are busy. It allows people working in active environment such as hospitals to use computers. It also allows people with handicaps such as blindness or palsy to use computers. Finally, although everyone knows how to talk, not as many people know how to type. With speech recognition, typing would no longer be a necessary skill for using a computer. If we ever were successful enough to be able to combine it with natural language understanding, it would make computers accessible to people who don't want to learn the technical details of using them.
In 1994, IBM was the first company to commercialize a
dictation system based on speech recognition. Speech Recognition has since been
integrated in many applications :
Many improvements have been realized since 50 years but computers are still not able to understand every single word pronounced by everyone. Speech Recognition is still a very cumbersome problem.
There are quite a lot of difficulties. The main one is that two speakers, uttering the same word, will say it very differently from each other. This problem is known as inter-speaker variation (variation between speakers). In addition the same person does not pronounce the same word identically on different occasions. This is known as intra-speaker variation. It means that even consecutive utterances of the same word by the same speaker will be different. Again, a human would not be confused by this, but a computer might. The waveform of a speech signal also depends on the recording conditions (noise, reverberation,...). Noise and channel distortions are very difficult to handle, especially when there is no a priori knowledge of the noise or the distortion.
A speech recognition system can be used in many different modes (speaker-dependent or independent, isolated / continuous speech, for small medium or large vocabulary).
Speaker Dependent / Independent system
A speaker-dependent system is a system that must be trained on a specific speaker in order to recognize accurately what has been said. To train a system, the speaker is asked to record predefined words or sentences that will be analyzed and whose analysis results will be stored. This mode is mainly used in dictation systems where a single speaker is using the speech recognition system. On the contrary, speaker-independent systems can be used by any speaker without any training procedure. Those systems are thus used in applications where it is not possible to have a training stage (telephony applications, typically). It is also clear that the accuracy for the speaker-dependent mode is better compared to that of the speaker-independent mode.
Isolated Word Recognition
This is the simplest speech recognition mode and the less greedy in terms of CPU requirement. Each word is surrounded by a silence so that word boundaries are well known. The system does not need to find the beginning and the end of each word in a sentence. The word is compared to a list of words models, and the model with the highest score is retained by the system. This kind of recognition is mainly used in telephony application to replace traditional DTMF methods.
Continuous Speech Recognition
Continuous speech recognition is much more natural and user-friendly. It assumes the computer is able to recognize a sequence of words in a sentence. But this mode requires much more CPU and memory, and the recognition accuracy is really inferior compared with the preceding mode. Why is continuous speech recognition more difficult than isolated word recognition?
Some possible explanations are :
there is more variation in stress and intonation (interaction
between vocal tract and excitation)
This mode has been created to cover the gap between continuous and isolated speech recognition. Recognition systems based on keyword spotting are able to identify in a sentence a word or a group of words corresponding to a particular command. For example, in the case of a virtual kiosk providing any customer with the way to a special department in a supermarket, there are many different ways of asking this kind of information. One possibility could be "Hello, can you please give me the way to the television department". The system should be able to extract from the sentence the important word "television" and to give the associated information to the customer.
The size of the available vocabulary is another key point in
speech recognition applications. It is clear that the larger the vocabulary is
the more opportunities the system will have to make some errors. A good speech
recognition system will therefore make it possible to adapt its vocabulary to
the task it is currently assigned to (i.e., possibly enable a dynamic adaptation
of its vocabulary). Usually we classify the difficulties level according to
table 1 with a score from 1 to 10, where 1 is the simplest system
(speaker-dependent, able to recognize isolated words in a small vocabulary (10
words)) and 10 correspond to the most difficult task (speaker-independent
continuous speech over a large vocabulary (say, 10,000 words)). State-of-the-art
speech recognition systems with acceptable error rates are somewhere in between
these two extremes.
Table 1 : Classification of speech recognition mode difficulties.
The commonly obtained error rates on speaker independent isolated word databases are around 1% for 100 words vocabulary, 3% for 600 words and 10 % for 8000 words [DER98]. For a speaker independent continuous speech recognition database, the error rates are around 15 % with a trigram language model and for a 65000 words vocabulary [YOU97].
The Speech Recognition Process
The Speech Recognition process can be divided in many different components illustrated in figure 4.
Fig. 4 The speech recognition process.
Note that the first block, which consists of the acoustic environment plus the transduction equipment (microphone, preamplifier, filtering, A/D converter) can have a strong effect on the generated speech representations. For instance, additive noise, room reverberation, microphone position and type of microphone can all be associated with this part of the process.
The second block, the feature extraction subsystem, is intended to deal with these problems, as well as deriving acoustic representations that are both good at separating classes of speech sounds and effective at suppressing irrelevant sources of variation.
The next two blocks in Figure 4 illustrate the core acoustic pattern matching operations of speech recognition. In nearly all ASR systems, a representation of speech, such as a spectral or cepstral representation, is computed over successive intervals, e.g., 100 times per second. These representations or speech frames are then compared to the spectra or cepstra of frames that were used for training, using some measure of similarity or distance. Each of these comparisons can be viewed as a local match. The global match is a search for the best sequence of words (in the sense of the best match to the data), and is determined by integrating many local matches. The local match does not typically produce a single hard choice of the closest speech class, but rather a group of distances or probabilities corresponding to possible sounds. These are then used as part of a global search or decoding to find an approximation to the closest (or most probable) sequence of speech classes, or ideally to the most likely sequence of words. Another key function of this global decoding block is to compensate for temporal distortions that occur in normal speech. For instance, vowels are typically shortened in rapid speech, while some consonants may remain nearly the same length.
The recognition process is based on statistical models
(Hidden Markov Models, HMMs) [RAB89,RAB93] that are now widely used in speech
recognition. A hidden Markov model (HMM) is typically defined (and
represented) as a stochastic finite state automaton (SFSA) which is assumed to
be built up from a finite set of possible states, each of those states being
associated with a specific probability distribution (or probability density
function, in the case of likelihoods).
Several authors [RIC91,BOU94] have shown that the outputs of
artificial neural networks (ANNs) used in classification mode can be interpreted
as estimates of posterior probabilities of output classes conditioned on the
input. It has thus been proposed to combine ANNs and HMMs into what is now
referred to as hybrid HMM/ANN speech recognition systems.
ANN estimation of probabilities does not require detailed assumptions about the form of the statistical distribution to be modeled, resulting in more accurate acoustic models.
For the ANN estimator, multiple inputs can be used from a range of speech frames, and the network will learn something about the correlation between the acoustic inputs. This is in contrast with more conventional approaches, which assume that successive acoustic vectors are uncorrelated (while this is clearly wrong).
ANNs can easily accommodate discriminant training, that is : at training time, speech frames which characterize a given acoustic unit will be used to train the corresponding HMM to recognize these frames, and to train the other HMMs to reject them. Of course, as currently done in standard HMM/ANN hybrid discrimination is only local (at the frame level). It remains that this discriminant training option is clearly closer to how we humans recognize speech.
Current Research In Speech Recognition
During the last decade there has been many research areas to improve speech recognition systems. The most usual one can be classified into the following areas : robustness against noise, improved language models, multilinguality, data fusion and multi-stream processing.
Robustness against noise
Many research laboratories have shown an increasing interest
in the domain of robust speech recognition, where robustness refers to the needs
to maintain good recognition accuracy even when the quality of the input speech
is degraded. As spoken language technologies are being more and more transferred
to real-life applications, the need for greater robustness against noisy
environment is becoming increasingly apparent. The performance degradation in
noisy real-world environments is probably the most significant factor limiting
take up of ASR technology. Noise considerably degrades the performances of
speech recognition systems even for quite easy tasks, like recognizing a
sequence of digits in car environment. A typical degradation of the performances
on this task can be observed in Table 2.
Table 2 : Word Error Rate on the Aurora 2 database
In the case of short-term (frame-based) frequency analysis,
even when only a single frequency component is corrupted (e.g., by a selective
additive noise), the whole feature vector provided by the feature extraction
phase in Fig. 4 is generally corrupted, and typically the performance of the
recognizer is severely impaired.
Other research tends to ameliorate language models which are
also a key point in the speech recognition systems. The language model is the
recognition system component which incorporates the syntactic constraints of the
language. Most of the state-of-the-art large vocabulary speech recognition
systems make use of statistical language models, which are easily integrated
with the other system components. Most probabilistic language models are based
on the empirical paradigm that a good estimation of the probability of a
linguistic event can be obtained by observing this event on a large enough text
corpus. The most commonly used models are n-grams, where the probability of a
sentence is estimated from the conditional probabilities of each word or word
class given the n-1 preceding words or word classes. Such models are
particularly interesting since they are both robust and efficient, but they are
limited to modeling only the local linguistic structure. Bigram and trigram
language models are widely used in speech recognition systems (dictation
Data Fusion and multi-stream
Many researchers have shown that by combining multiple speech recognition systems or by combining the data extracted from multiple recognition processes many improvements can be observed. Some sustained incremental improvements based on the use of statistical techniques on ever larger amount of data and different annotated data should be observed in the next years. It may also be interesting to define the speech signal in terms of several information streams, each stream resulting from a particular way of analyzing the speech signal [DUP97]. For example, models aimed at capturing the syllable level temporal structure could then be used in parallel with classical phoneme-based models. Another potential application of this approach could be the dynamic merging of asynchronous temporal sequences (possibly with different frame rate), such as visual and acoustic inputs.
Multilingual Speech Recognition
Addressing multilinguality is very important in speech recognition. A system able to recognize multiple languages is much easier to put on the market than a system able to address only one language. Language identification consists in detecting the language spoken and enables to select the right acoustical and Language models. Many research laboratories have tried to build systems able to address this problem with some success (both the Center for Spoken Language Understanding, Oregon, and our laboratory are able to recognize the language in a 10 second speech chunk with an accuracy of about 80 %). Another alternative could be to use language-independent acoustic models, but this is still at the research stage.
[BOI00] R. Boite, H. Bourlard, T. Dutoit, J. Hancq, H. Leich,
2000. Traitement de la parole. Presses polytechniques et universitaires
romandes, Lausanne, Suisse, ISBN 2-88074-388-5, 488pp.