Technology

Voice recognition technology is complex and until this decade too expensive for most computer applications. The cornerstone of voice recognition technology is an electronic device called an analog to digital converter (ADC). The ADC converts the analog speech waveform to a series of digital signals. These signals are stored as a template and carry the time based characteristics for each word spoken. Unknown words are recognized by comparing them with the templates of other words held in storage.

The pattern matching technique requires a lexicon of the templates using phonetic characterizations of the words. A knowledge of grammar places constraints on the sequences of words and affects the probability of words in series. The characteristic features of the spoken word are identified. Then the pattern matching algorithm is used to search in all the grammatically possible word sequences for the word with the highest probability of generating the determined characteristics. Hidden Markov Models (HMM) are the most widely used statistical model for speech recognition. A HMM uses two independent state transitions to search through the vocabulary. Each state generates two sets of probabilities. The first set of probabilities is for continuing on to the next state and the second set is the probabilities the word was correct.

Speech is made up of discrete units. For the purposes of the analog to digital conversion the unit can be a word, a syllable or a phoneme. A phoneme is a single characteristic sound of a language i.e. the 'c' sound in the word 'cat'. An utterance is a series of phonemes with stress, duration and intonation. The design choice of the base speech unit characterizes the speech recognition system. An English speaking adult might have a vocabulary of as many as 100,000 words, 20,000 syllables, 2500 diphonemes and depending on accent 40 to 50 phonemes (Pelton, 1993, p93). Each stored unit of speech includes details of the characteristics that differentiate it from the others. The amount of storage required and processing times is a function of the number of speech units.

Speech recognition differentiates between accents, dialects, age, gender, emotional state, rate of speech and environmental noises. The different systems are classified according to the methodologies used to attain these goals. There are speaker dependant systems with discrete or continuous speech recognition and speaker independent systems with discrete or continuous speech recognition.

The different voice recognition systems are characterized by the training required. Speaker independent systems require no training. Typically, these systems are characterized by limited word vocabularies and are task dedicated systems. Speaker dependent systems require training for each user. Word based recognition systems require training on every word in the vocabulary. Phoneme based systems require training on a large number of representative sentences. High performance systems take into account contextual word meanings, require training and incorporate a sub-word speech unit.


Bibliography

Doe Hope L., Evaluating the Effects of Automatic Speech Recognition Word Accuracy, 1998, Master's Thesis, Virginia Polytechnic Institute and State University, Blacksburg, Va., USA

Pelton Gordon, Voice Processing, 1993, ISBN 0-07-049309-X, McGraw-Hill, San Francisco, USA.

Russell Stuart and Norvig Peter, Artificial Intelligence: A Modern Approach, 1995, ISBN 0-13-103085-2, Prentice-Hall Inc., Upper Saddle River, New Jersey, USA.

Schmandt Christopher, Voice Communications with Computers, 1994, ISBN 0-442-23935-1, International Thomson Publishing, New York, New York, USA.

Introduction User Scenarios