[Back] [Contents Page] [Next]


Chapter 1 Introduction



1.1 Introduction

In its most general form, speech recognition is the conversion of an acoustic waveform into the text equivalent of the information being conveyed by the spoken word. The ultimate goal in most research into speech recognition systems has been to develop a machine that has the ability to understand continuous speech with an unlimited vocabulary. This goal remains, as yet, unfulfilled although major developments have been made in the field of speech recognition.

There are many advantages to the use of speech which can be gained with the development of effective recognition systems which can be used for communication between man and machines. When speech communication is used problems can usually be solved faster, the hands and eyes are left free, the user is able to move about the room and still remain in contact with the machine and communication can take place over the telephone without the need for additional equipment such as modems.

Man-machine communication has been dominated by typing which is an accurate but relatively slow method of communication. Writing is a faster form of communication for most people (i.e. untrained typists) but has a major disadvantage for machine recognition in that a machine would have to be able to recognize a large variation in form for most letters.

Speech has potential as the fastest form of man-machine communication. It is the most natural form of human communication and for the majority is learned from early childhood. It is also faster than both typing and handwriting although it shares a major disadvantage with handwriting recognition. There exists such a wide variation in the way different letters and words can be acceptably pronounced that the machine would have to be able to recognize all these variations to achieve the ultimate goal in speech recognition.


1.2 Overview Of Research In Speech Recognition

Initially it was believed that speech could be recognized using the information contained within the acoustic signal. It was realized that simple matching of acoustic patterns was limited since the same word, spoken on different occasions, differed in duration and intensity even when spoken by the same person. When the same word is spoken by different speakers the speech signal will also differ in frequency content. In the speech recognition systems that followed, spectral analysis was carried out on the speech patterns. Early attempts at speech recognition also failed to realize that for someone listening to speech they require linguistic knowledge to understand what is being said. There are many sources of linguistic knowledge and, in the 1970s, systems were being developed which used these to improve performance. An example is the HEARSAY system used for the recognition of connected speech [44]. One implementation was to recognize spoken moves in a game of chess. This system used the syntax and vocabulary of a language and the semantics of a task to recognize utterances in that language. In the case of the chess task the system could use a list of all the possible moves at the current time together with their likelihood to aid recognition.

The HEARSAY and HEARSAY II systems were the result of a 5 year program of research and development sponsored by the Advanced Research Projects Agency (ARPA) which was begun with the aim of tackling the problems encountered in recognizing continuous speech [24]. Another system resulting from this program was the SPEECHLIS speech understanding system which used syntactic, semantic, pragmatic and lexical knowledge interaction for continuous speech understanding [62]. Other systems emerging from this program were the DRAGON system which used a Markov approach [2] and HARPY which merged parts of the DRAGON and HEARSAY systems and was the most successful system. The HARPY system best met the specifications of the ARPA program [24]. The problem with these systems was that they required large computational power at a time when this was expensive [32].

Since the mid 1970s research in the field of speech recognition has increased dramatically. Many mathematical based algorithms have been developed and a number of speech recognition systems are now commercially available. The greatest success has been achieved using pattern recognition approaches to speech recognition by machine. This approach is based on the idea that if a system has "seen" a pattern then it can recognize it. These systems are trained to recognize patterns and require an adequate training set and a suitable pattern recognition model.
Template matching techniques match the input pattern to a set of templates and the category to which the input pattern belongs is determined using a similarity measure between the template and the input pattern. A major advance was made with the introduction of dynamic programming algorithms introduced initially in 1970. Dynamic programming (DP) is a mathematical concept and the theory behind it is based on the principle of optimality which is a simple property of multi-stage decision processes and can be stated as follows:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions constitute an optimal policy with regard to the state resulting from the first decision [52].

DP algorithms can be used to eliminate timing differences between two speech patterns by warping the time axis of one speech pattern so that the two are optimally aligned with each other. Recognition of a speech pattern can be achieved by warping the input speech pattern to reference patterns from each category to which the input pattern could belong. A decision is then made as to which category the input speech pattern belongs to by assigning it to the category of the reference pattern with which it achieves maximum coincidence. This method is also known as dynamic time warping (DTW) due to the warping function performed on the speech pattern time axis.

The power of DTW for speech recognition was demonstrated by the work of Sakoe and Chiba which was initially published in the early 1970s. Itakura modified their work and achieved 97.3% recognition rate on a database of 200 words spoken in isolation by one male speaker [22]. Sakoe and Chiba made minor modifications to their initial work including comparing symmetric and asymmetric forms of the algorithm and introducing a slope constraint on the warping function [49]. They consequently achieved 99.8% speaker dependent recognition on a database of the 10 Japanese digit words spoken in isolation and 99.2% on a database of 50 Japanese geographical names.

Sakoe extended the research to connected word recognition by employing a two level DP matching algorithm. Pattern matching was carried out by breaking it down into a word level matching step and a phrase level matching step [51]. No prior segmentation was carried out but rather decisions on the word boundaries were made in parallel with word category decisions. The same databases as Sakoe and Chiba used [49] were employed and a recognition rate of 99.6% was achieved.

The DP algorithms described assume that the initial and final frames of input speech pattern and the reference speech pattern are in exact time alignment. Rabiner, Rosenberg and Levinson modified the DTW algorithm to include the existence of uncertainty in the registration of the initial and final frames [42]. The application of a slope constraint to the warping function was also investigated in this work. A database of 10 words spoken in isolation by 100 speakers was used to test the speech recognition system. It was found that the removal of endpoint constraint leads to performance improvement but introduction of a slope constraint does not necessarily lead to improvement of the performance. Factors were given which could affect the performance of this method and further investigation was therefore recommended.

Rabiner and Schmidt applied the unconstrained DTW algorithm to connected word recognition where the the words were digits spoken over dialed up telephone lines [43]. Isolated word templates were employed and the system could be used as either a speaker dependent or speaker independent system. Recognition rates of between 98% to 99% were achieved on the individual digits and between 91% to 95% for strings of digits.

The same task was investigated by Myers and Rabiner, this time using a level building DTW algorithm [35], [36]. The level building DTW algorithm matches the input pattern string to stored digit reference templates. At each level, or point in the string, the accumulated distance, best candidate digits and back tracking pointers are fed back to start the recognition of the next digit. The string which has the minimum accumulated distance at the end is chosen as the best match. A recognition performance of 99.1% was achieved for individual digits and 95.4% for strings of digits.

An algorithm based on stochastic models is hidden Markov modeling (HMM) in which a generative model of each class is built. Therefore, in the HMM approach, the reference pattern is represented by a model as opposed to the pattern itself. An HMM consists of two parts; a finite state Markov chain and a finite set of probability distributions [41]. The finite state machine comprises of a number of states and the probabilities of transition from one state to another. The output probabilities consist of the probabilities that, for a particular state transition, a certain output will be observed. An observer will see the outputs produced but not the states and transitions as these are "hidden". Recognition of the input pattern is achieved by determining which model has the greatest probability of generating the output sequence observed. One method of finding the most likely state model is the viterbi algorithm which is a stochastic version of dynamic programming [52]. Some research has looked at speech recognition systems which utilize both HMM and DP algorithms. An example is Fissore et al. who used both methods in a large vocabulary isolated word recognition system [11].

At the time this work was begun, results with HMMs trained along a maximum likelihood criterion showed their discriminant properties were poor [3]. This made them unsuitable for vocabularies containing highly confusable sets of words such as the English alphabet which is the vocabulary of interest in this thesis. HMMs trained using the maximum mutual information criterion were providing better discrimination but were more complicated. Many new developments in HMMs have taken place since then and at the moment HMMs are the most widely used speech recognition tool in state-of-the-art systems. HMMs still have the disadvantage that using contextual information with them is difficult and requires large storage capacity. This problem is being tackled in current ongoing work using a hybrid HMM/Connectionist approach to continuous speech recognition [6][46].

In the 1980s multi-layer perceptrons (MLP) were recognized as being a potentially suitable speech recognition tool. An MLP, also known as a neural network, is a general pattern recognition mechanism which can learn to discriminate between different categories of speech signals by "seeing" examples during training. The development, theory and training of neural networks is covered in more detail in chapter 2 but a brief review of their use for speech recognition is given here.

Neural networks are comprised of a network of interconnected, basic processing units which perform like simple models of biological neurons. Great hope has been placed on neural networks to perform tasks such as speech recognition since the human brain has already achieved the ability to perform this task. Understanding of the nervous system and biological neurons is incomplete and it may be the case that neural networks are too simple to achieve the level of speech recognition of humans [25]. Current neural network architectures do offer good discriminative powers, the ability to deal with non-explicit knowledge and contextual information can easily be taken into account [3]. A large body of work has been carried out applying neural networks to both connected and isolated word recognition using many different approaches.

In order to take into account contextual information, the time delay neural network (TDNN) approach has been adopted in some research. An example of this work is Waibel et, al. who used TDNNs for the recognition of the phonemes "B", "D"and "G" [58], [59]. A 4 layer neural network trained with the back propagation algorithm was employed with the network inputs being 16 normalized Mel scale Cepstral coefficients at each time frame. At every interval of time, 15 frames were input to the neural network, that is, the current frame for that time period and the frames from the previous 14 time intervals. After training, recognition is then carried out by passing the input speech pattern over the input neurons. A large vocabulary of Japanese words, 5240 words altogether, were recorded and from these the phonemes to be recognized were extracted. An average recognition rate of 98.5% was attained on the test data by the TDNN whereas 93.7% was achieved by an HMM given the same recognition task. Therefore, the TDNN achieved much better performance than the HMM for this task.

Waibel, Sawai and Shikano used TDNNs in a speech recognition system where smaller networks were trained to learn phonemic subcategories [60]. These subcategory networks were then used to "grow" larger networks without any degradation in performance. Several techniques for growing these networks without degrading performance, incurring large training times or requiring more data, were proposed. Two subnetworks were trained, one to recognize the phonemes "B", "D" and "G", achieving 98.3%, and the other to recognize the phonemes "P", "T" and "K", achieving 98.7% performance. The neural network created from these subcomponent networks to recognize "B", "D", "G", "P", "T" and "K" achieved 98.6%. A network trained incrementally to recognize all the consonants achieved recognition rates of around 96%. These results were compared to those of a standard HMM which achieved 83.6% performance and an improved HMM which achieved 92.7% showing that the neural network performed significantly better.

Hampshire and Waibel used a TDNN for the recognition of the phonemes "B", "D" and "G" employing a number of enhancements including a new objective function [15]. This function was found to be a substantial enhancement proving to be less prone to overlearning and providing better generalization on the training data.

An alternative neural network approach is the recurrent neural network also referred to as a dynamic neural network [40]. In this method the contextual information does not need to be used explicitly at the input to the neural network. Instead, a feedback delay loop is added to each neuron in the hidden and output layers. Contextual information can be used explicitly by using a network like that of a TDNN and having feedback loops on the hidden and output layer neurons. Another alternative architecture for recurrent neural networks is to feedback the delayed hidden neuron output values to the input layer [3]. A modification to the error back propagation algorithm is required to allow the looped signals to reach a steady state [48], [40].

Other types of neural networks have proved attractive alternatives for speech recognition. Radial basis function networks are an example of such networks with the advantage that they may be trained much faster than standard back propagation networks. Renals and Rohwer compared the performance of radial basis function networks, standard back propagation networks and an HMM on the recognition of isolated phonemes [45]. Both neural network approaches outperformed the HMM. The radial basis function was trained in 4 minutes as opposed to the back propagation network which took 3 hours but the back propagation network achieved better performance and is much more robust than the radial basis function network.

The hidden control neural network (HCNN) uses a source model of the data to improve performance [28]. It effectively measures the distance in pattern space between the speech pattern to be recognized and a reference pattern generated by the class model.

Neural networks can also be used to implement the algorithms used in conventional speech recognition systems [20]. During unsupervised training for self organizing networks using Kohonen's algorithm, continuous valued input vectors are presented to the network sequentially and no desired output vectors are specified [25]. The weights will organize such that topologically close nodes become sensitive to similar inputs forming feature maps which associate different clusters with different output nodes [31]. This acts like a neural network implementation of a k-nearest neighbour classifier. A viterbi net is a neural network implementation of a viterbi decoder used in HMM recognizers [20].

Neural networks have also been applied to the problem of continuous speech recognition. Harrison and Fallside used a neural network approach for the recognition of phonemes in continuous speech using a two part system [16]. The first part was a 3 layer back propagation neural network which classified the spectral speech data into speech subunits. The second part consisted of a sequence classifier formed from neuron like components which recognizes phonemes from sequences of subunits. The database used consisted of 162 training words and 90 test words spoken by one speaker. The task was to recognize a subset of 14 phonemes of British English and an overall performance of 87% was achieved.

Another example of a neural network approach to continuous speech recognition is the speaker independent system developed by Franzini, Witbrock and Lee [12]. The task was to recognize strings of digits and the system comprised a 4 layer recurrent neural network and an HMM. The neural network produced hypotheses of the components of the utterance to be recognized. These were used as inputs to the HMM which made the final decision. A performance of 97% was achieved on the individual digit words and 91% for the complete sentences.

Some research has looked into the development of systems which incorporate neural networks and the more traditional algorithms used in speech recognition systems. Huang and Kuh used a neural network based system for the recognition of isolated words from a medium sized vocabulary. This system also included the use of feature maps which were trained using a k-means algorithm. These feature maps transformed the input data into binary valued vectors which were then used as inputs to a single layer back propagation neural network. Using more than one feature map partitions the database of possible word categories into two subdatabases. The performance for correctly partitioning the input to the correct subdatabase was 98% and the overall word recognition rate achieved was 93%.

Botros, Siddiqui and Deiro looked at another type of speech recognition system which combined HMMs and neural networks for isolated word recognition. An HMM is used to extract the states of the speech signal, then these and the spectral data extracted from the input speech data are introduced to a neural network. The neural network is trained for every word to be recognized by the system. It predicts the feature vector which will be produced during the next time frame and the difference between the predicted feature vector and the actual feature vector is calculated. This error value is summed over all the frames and then the network which produces the lowest error indicates the word category to which the input pattern belongs. This method was shown to perform significantly better than an HMM only system.

An alternative back propagation network architecture to the TDNN is an MLP which uses a time alignment algorithm as an initial step before the input pattern is presented to the input of the neural network for recognition. Several algorithms exist using either linear time alignment [61] or non-linear time alignment algorithms, such as dynamic time warping [51], [18] and trace segmentation [10]. The TDNN is the approach adopted for the research carried out as described in this thesis where various non-linear time alignment algorithms are compared. The time alignment algorithms are described in more detail in Section 3.3.


1.3 Recognition Of The English Alphabet

The recognition of the spoken English Alphabet by machine has long known to be a difficult and challenging task in Speech Recognition [4]. This is due to the existence of several subsets of easily confusable letters. These subsets are; the E set {B, C, D, E, G, P, T, V}, the set {Q, U}, the set {M, N} and the A set {A, K, J}. To perform the recognition of the letters belonging to these sets, the speech recognition system must use all the discriminatory information contained within the speech signal.

A speech recognition system capable of recognizing the letters of the English alphabet would be useful in many applications. In situations where name information must be given by the user, such as automated directory listing retrieval, spelling of the the name is very often needed to clarify what it is since there are such a limitless number of possible names. Spelling of a word or sentence can be used as "back up" in connected word recognition systems where the system has failed to recognize that portion of the speech.

As mentioned in the previous section, many speech recognition systems use statistical methods of pattern matching. Work has been done using statistical methods for the recognition of the letters of the English alphabet spoken in isolation.

Soong uses a segment-based-network approach to speaker dependent recognition [54]. For each letter, a training token is segmented into subword tokens which consist of 4 units; stationary sounds, fast transitional sounds, slow transitional sounds and consonant clusters. This segmentation is done by hand and is used to initialize the network word reference. The remaining training tokens are aligned to the reference token using a dynamic programming procedure. The aligned training segments are then used to refine the network. An English alphabet database consisting of 4 speakers repeating each letter 5-7 times during training and 10 times during testing was used. An error rate of 7.7% over all the speakers was achieved with this subword-segment-based network, i.e., a performance of 92.3%.

Stern and Vasry applied a maximum posteriori probability (MAP) estimation technique on a speaker by speaker basis as new utterances are presented [55]. The MAP estimation is used to update the mean feature vectors. Classification is carried out using a decision tree structure to make decisions and classify each utterance into its category. An English alphabet database consisting of 20 speakers each repeating the 26 letters of the alphabet 4 times was used. Training was carried out 20 times. Each time 19 speakers were used for training and the 20th was used to test the system changing the test speaker every time. Spectral and temporal analyses were performed on each utterance to provide 60 measurements describing various features of the speech signal. Several different implementations of the MAP learning algorithm were considered. Over the whole alphabet error rates of as low as 6.2% were achieved, that is a performance of 93.8%.

Huang and Gray investigated the use of vector quantization (VQ) based algorithms [19]. The best results were obtained with the conditional histogram VQ algorithm where a performance of 91.59% was achieved on the test data. The database was used in speaker independent mode and consisted of the letters of the English alphabet spoken by 16 speakers over 9 recording sessions. The recordings from the first session were used for training and each speaker provided 10 spoken examples of each letter. The remaining 8 sessions were used for testing and at each session the speakers provided 2 examples of each letter. Linear predictive coding (LPC) analysis was carried out on the data and it was presented to the speech recognition system in this form.

Dai et, al. applied a hidden Markov model (HMM) approach to the problem of recognition of the letters of the English alphabet [9]. A conventional HMM approach was studied and an HMM-TMM (hidden Markov model-time ordering Markov model) approach where time ordering instead of correlation is used to describe the relation between consecutive speech vectors. Experiments were carried out using the British Telecommunication Research Laboratories (BTRL) English alphabet database (see Section 3.1) in the speaker independent form. For the standard HMM a recognition performance of 86.9% was achieved using dynamic spectrum vectors, that is, difference spectrum vectors derived from the static spectrum vectors. A performance of 89.3% was achieved using the HMM-TMM method again with dynamic spectrum vectors. Both of these results were obtained for recognition of the whole alphabet. Recognition of the letters belonging to the E set alone was investigated and in this case the HMM method achieved 81.1% performance and the HMM-TMM method achieved 84.7% performance.

The work described previously on recognition of the letters of the English alphabet illustrates the use of non-neural network methods of pattern recognition. Work has been done using neural networks to recognize the English alphabet. Burr used a 3 layer fully connected neural network and compared its performance with that of a DTW algorithm and achieved 85% performance for both methods [4]. A level offset method which biases input values was used to decrease the learning time and weight pruning was applied to reduce the number of connections in the network (see Section 2.2). Burr also studied the effect of varying the number of hidden units in the neural network and found that there was a critical number above which no improvement in recognition occurs and a slight decrease can even be observed. The database used consisted of 104 speakers saying each letter of the alphabet 4 times. Of the 4 utterances, 3 were used for training the network and the 4th was used to test it. Before being presented to the inputs of the neural network, the speech signal was bandpass filtered, sampled at 10kHz and quantized to 12 bits. It was then pre-emphasised before an end pointing algorithm was applied to detect the utterance boundaries. Spectral values were calculated using a 128 point Hamming windowed fast Fourier transform (FFT) and then 8 spectral averages were taken at each of the 20 time frames.

Cole, et, al. used a system which utilizes neural networks at several different stages for English alphabet recognition [5]. In this system the speech signal was digitized then processed to produce a set of representations: the number of zero crossings in a 10 msec window, the peak to peak amplitude of the original waveform and its low pass filtered form (700Hz), a 128 and a 256 point discrete Fourier transform (DFT) on a 10msec Hamming window and the spectral difference. A neural network was then used to locate the pitch periods in the low pass filtered waveform. At the next stage, the speech was segmented into contiguous intervals and assigned to one of 4 categories; closure or background noise, sonorant interval, fricative and stop. A rule based segmenter and a neural network segmenter were compared. For each utterance 617 features were computed and classification was then carried out by a neural network. The database used consisted of 150 speakers saying the letters randomly. With the rule based segmenter, speaker independent recognition performance of 95.3% was achieved and 96% for multiple speaker recognition. With the neural network segmenter, 93.8% performance was achieved for speaker independent recognition and 94.8% for multiple speaker recognition.

Woodland applied a neural network to the recognition of the letters of the English alphabet using the BTRL database (see Section 3.1) [61]. A 3 layer fully connected neural network was trained using the standard back propagation algorithm. Several networks of different sized hidden layers were trained over 1500 training sweeps for both multiple speaker recognition and speaker independent recognition. For multiple speaker recognition. The best performance achieved was 99.3% on the training set and 91% on the test set. For speaker independent recognition, 99.9% performance was achieved in the training data and 88.3% on the test data.


1.4 Structure Of Thesis

The work described in the remainder of this thesis researches the use of neural networks for speech recognition. A simple time alignment algorithm to be used for the preprocessing of speech signals is investigated. A reduced architecture network called a scaly architecture neural network is examined and its performance on the task of speech recognition assessed. The scaly architecture is optimized to find the configuration which performs best on this task and compared to the performance achieved by an equivalent fully connected neural network.

Chapter 2 provides a basic introduction to neural networks and an overview of their development, theory and training. Reduced connectivity neural networks and their advantages are outlined and the scaly neural network architecture is described in detail. Training of neural networks is discussed and the error back propagation algorithm is described in detail.

In chapter 3, the problems of presenting raw speech data to a neural network are discussed and the required preprocessing of the data to avoid these problems is outlined. The use of time alignment algorithms at the input to a neural network is discussed and the algorithms are described in detail.

The research carried out to investigate the different time alignment algorithms and compare their performance is described in chapter 4. The scaly architecture neural network is investigated in depth in chapters 5 and 6. The results obtained by varying different parameters of the architecture are presented and an optimal structure proposed.

An overview of the results obtained by the research is presented in chapter 7 and the conclusions that can be drawn from them are listed.


[Back] [Contents Page] [Next]