Technology
Voice recognition technology is complex and
until this decade too expensive for most computer applications.
The cornerstone of voice recognition technology is an electronic
device called an analog to digital converter (ADC). The ADC
converts the analog speech waveform to a series of digital
signals. These signals are stored as a template and carry the
time based characteristics for each word spoken. Unknown words
are recognized by comparing them with the templates of other
words held in storage.
The pattern matching technique requires a lexicon of the
templates using phonetic characterizations of the words. A
knowledge of grammar places constraints on the sequences of words
and affects the probability of words in series. The
characteristic features of the spoken word are identified. Then
the pattern matching algorithm is used to search in all the
grammatically possible word sequences for the word with the
highest probability of generating the determined characteristics.
Hidden Markov Models (HMM) are the most widely used statistical
model for speech recognition. A HMM uses two independent state
transitions to search through the vocabulary. Each state
generates two sets of probabilities. The first set of
probabilities is for continuing on to the next state and the
second set is the probabilities the word was correct.
Speech is made up of discrete units. For the purposes of the
analog to digital conversion the unit can be a word, a syllable
or a phoneme. A phoneme is a single characteristic sound of a
language i.e. the 'c' sound in the word 'cat'. An utterance is a
series of phonemes with stress, duration and intonation. The
design choice of the base speech unit characterizes the speech
recognition system. An English speaking adult might have a
vocabulary of as many as 100,000 words, 20,000 syllables, 2500
diphonemes and depending on accent 40 to 50 phonemes (Pelton,
1993, p93). Each stored unit of speech includes details of the
characteristics that differentiate it from the others. The amount
of storage required and processing times is a function of the
number of speech units.
Speech recognition differentiates between accents, dialects, age,
gender, emotional state, rate of speech and environmental noises.
The different systems are classified according to the
methodologies used to attain these goals. There are speaker
dependant systems with discrete or continuous speech recognition
and speaker independent systems with discrete or continuous
speech recognition.

The different voice recognition systems are
characterized by the training required. Speaker independent
systems require no training. Typically, these systems are
characterized by limited word vocabularies and are task dedicated
systems. Speaker dependent systems require training for each
user. Word based recognition systems require training on every
word in the vocabulary. Phoneme based systems require training on
a large number of representative sentences. High performance
systems take into account contextual word meanings, require
training and incorporate a sub-word speech unit.
Bibliography
Doe Hope L., Evaluating the Effects of Automatic Speech
Recognition Word Accuracy, 1998, Master's Thesis, Virginia
Polytechnic Institute and State University, Blacksburg, Va., USA
Pelton Gordon, Voice Processing, 1993, ISBN 0-07-049309-X,
McGraw-Hill, San Francisco, USA.
Russell Stuart and Norvig Peter, Artificial Intelligence: A
Modern Approach, 1995, ISBN 0-13-103085-2, Prentice-Hall Inc.,
Upper Saddle River, New Jersey, USA.
Schmandt Christopher, Voice Communications with Computers, 1994,
ISBN 0-442-23935-1, International Thomson Publishing, New York,
New York, USA.