[Back] [Contents Page] [Next]

Chapter 7 Conclusions and Future Work



In the course of the work covered in this thesis several aspects of neural networks were investigated. The goal was to optimize the performance of a neural network in the task of recognizing spoken letters from the English alphabet. To this end, the following features of the neural network speech recognition system were investigated; the time alignment algorithm for preprocessing of the speech signal, the architecture of the neural network, the activation function employed within the neural network, the learning rate and the output node target values.

The time alignment algorithm and the architecture of the network have potential to be minimized in complexity to reduce computational costs. This would result in a faster, more efficient system. A simplified time alignment algorithm called trace segmentation is investigated in detail and compared to the dynamic time warping which has previously been the most popular algorithm employed. A reduced architecture neural network is examined and its performance compared to that of a fully connected network performing the same task.


7.1 Time Alignment Algorithm

The simplest algorithms available are linear algorithms which simply remove or duplicate feature vectors within the speech pattern vector to obtain the required length. Linear algorithms do not take into account the importance of a feature vector before removing it from the speech pattern which is detrimental to the correct recognition of the speech pattern. A feature which makes that speech pattern distinctive from another may be removed and then cannot be used in the recognition process.

Non-linear algorithms are more complex than linear ones but they do take into account the importance of feature vectors when manipulating them to change the length of the speech pattern. Previously, the preferred time warping algorithm has been dynamic time warping (DTW) and it has been widely employed in speech recognition systems. This algorithm warps the time axis of the speech pattern to be recognized to the time axis of a reference speech pattern until they maximally coincide. This algorithm is very computationally intensive and it requires a previously identified reference pattern to be used in its operation.

There exists a much simpler non-linear algorithm which does not require the use of a reference pattern and it is known as trace segmentation (TS). The TS algorithm is based upon the assumption that despite timing differences, for speech patterns of the same category, fluctuations in frequency with time will occur in the same sequence but over different lengths of time. Where there is no change in frequency over a period of time there will be a large number of sampled feature vectors of around the same values. The TS algorithm reduces the size of the feature vector by removing some of these samples. Increasing the size of a pattern vector can be achieved linearly by repeating feature vectors at regular intervals throughout the pattern. According to the assumption upon which the TS algorithm is based this should not have any consequence in the recognition of the speech pattern.

The TS algorithm achieves results comparable to the DTW algorithm when tested upon small subsets of letters from the English alphabet. When applied to the task of recognizing the whole English alphabet, the network receiving data preprocessed by the TS algorithm actually gives better results than the DTW algorithm. This is a very significant finding because it suggests that the TS algorithm can be employed in speech recognition systems resulting in substantial savings in computational costs over the DTW algorithm. These savings are made without any loss in performance and in some cases actually resulting in better performance.


7.2 Neural Network Architecture

Neural networks have proven to be useful tools for pattern recognition and are therefore suitable for discriminating between different categories of spectral patterns such as those obtained for speech signals. The most common form of a neural network is a fully connected architecture network. Every node in the input layer is connected to every node in the hidden layer and every node in the hidden layer is connected to every node in the output layer. The complexity of the networks means that their training and operation take a relatively long time. In a hardware situation the problem comes with scaling and the number of components required to implement a network.

A reduced architecture known as a scaly neural network was investigated in detail to determine its ability in the recognition of spoken letters from the English alphabet. In the scaly architecture network input frames are divided up into zones which are overlapped with each other. Each zone is associated with a single frame in the hidden layer and every frame in that zone is connected to the hidden layer zone. Several features of the scaly architecture can be varied and these features were investigated to determine which configuration gave optimal results for the task in question.

The first feature of a scaly network that was investigated was the size of the input layer. It was found that the performance of the network improved initially as the input layer size was increased. When the size of the input layer reached 30 frames the gain in performance becomes very small with further increases in input layer size. Beyond an input layer size of 35 frames any improvement in performance achieved is insignificant and in some cases the performance starts to fall as the input layer size is increased. It was determined that an input layer size of 35 frames gave optimal performance with minimal complexity of network.

The size of the input zone, the size of the input zone overlap and the size of the hidden layer were then investigated to evaluate their effect on the performance of the network. It was established that a network with as small an input zone as possible and as large an overlap as possible for that input zone size gave the best performance on the test data set. This was also the configuration that gave as large a hidden layer size as was possible for that size of input layer in a scaly neural network. The best architecture for the scaly network with an input layer size of 35 frames was therefore found to be an architecture with an input zone size of 2 frames, an overlap of 1 frame and a hidden layer size of 34 frames. This is for speaker independent recognition where the performance on the test data is the most relevant result.

Once an optimal basic scaly architecture neural network was determined for recognizing the letters of the English Alphabet a few simulations were run to examine the effect of adding extra connections. The idea is to retain the basic scaly architecture but extend it to add extra connections and determine if any noticeable improvement in performance is obtained. The scaly architecture is extended by introducing zones in the hidden layer as well as the input layer. Instead of every input frame in an input zone being connected to one associated frame in the hidden layer they are connected to every frame in the associated hidden zone.

The scaly networks that have been looked at so far in this thesis have a hidden zone size of one frame. The basic optimal scaly architecture is implemented with the number of zones in the hidden layer gradually increased in steps of one frame and the performance on recognizing the whole alphabet obtained. The sigmoid01 activation function and the output node target values 0 and 1 are used as these were previously found to produce better results and the hidden zones are overlapped by the number of frames in a hidden zone less one. Figure 7.1 shows the performance obtained as the size of the hidden zone is increased starting with a hidden zone size of one frame and increasing up to a size of 6 frames.

It can be seen from Figure 7.1 that there is a very slight increase in the performance of the network on the training data as the hidden zone size is increased. This is not the case for the test data where the performance varies very little as the hidden zone size is increased. Looking at the values for the performance achieved in Table 7.1 it can be seen that there is actually a slight decrease in the performance, approximately 1%, as the hidden zone increases. This may be a result of the network learning the training set too successfully and being less able to generalize to unseen data such as that in the test set. The more connections in the network the better it learns the training set resulting in better performance on that data but poorer generalization.

The total number of weights in each of these networks is listed in Table 7.1. It can be seen that using a hidden zone size of 6 increases the total number of weights in the network by 31%. The number of weights in the network is increased by nearly a third for a very small increase in performance on the training data and a decrease in performance on the test data.


Table 7.1 Performance, Error And Number Of Weights For Networks With Increasing Size Of Hidden Zone

[Performance, Error And Number Of Weights For Networks With Increasing Size]



[Network Performance With Size Of Hidden Zone]

Figure 7.1 : Network Performance With Size Of Hidden Zone



[Network Error With Size Of Hidden Zone]

Figure 7.2 : Network Error With Size Of Hidden Zone



7.3 Other Features Of The Neural Network

Other variable features of the neural network were investigated to determine the most suitable for a network carrying out the task of recognizing spoken letters from the English alphabet. These features were the activation function used with in the neurons in the network and the learning rate employed in the back propagation algorithm.

Three possible activation functions were available for use with the neural network simulation software used; the sigmoid function over the interval 0 to 1, the sigmoid function over the interval -1 to 1 and the sigmoid function over the interval -1.71 to 1.71. The sigmoid function over the interval -1.71 to 1.71 was found to give poor results and was therefore found to be unsuitable for the task in question. The other two sigmoid functions gave comparable results with the sigmoid function over the interval -1 to 1 giving the best results overall. It was found that the sigmoid function over the interval 0 to 1 gave the best results on the test data which are the most significant results in a speaker independent recognition system. It should be kept in mind that the differing results achieved with the three activation functions may be the consequence of differences in convergence rate and the fact that the networks are still converging when training is halted.

No significant pattern could be found in the results obtained for the three learning rates investigated; 0.01, 0.005 and 0.008. This meant that a learning rate which gave consistently better results could not be determined. Overall the best results were obtained when the learning rate was 0.01 so this value was employed in any subsequent simulations.


7.4 Output Node Target Values

It was suggested that better performance may be gained, especially for speaker independent recognition, by utilizing target values on the output nodes other than those usually employed. This means using values such as 0.1 and 0.9 or 0.01 and 0.99 instead of the more usual 0 and 1. The theory is that, since the network is not trying to attain the perfect values of 0 and 1 but lesser values, overlearning is less likely to occur. This will mean that the network is more capable of dealing with speaker independent recognition where it has to recognize examples of speech from speakers it has never "heard" before. This is because it has not become too specialized in recognizing those speech samples that it has already experienced.

It was established that using values such as 0.1 and 0.9, which are relatively further away from 0 and 1, gave the poorest results and were not used in any further simulations. Very little difference was found between the results obtained for the output node target values like 0.99 and 0.01 when compared to the more usually employed values. No justification was found from the results achieved for the use of output node target values other than the absolute values such as 0 and 1.


7.5 An Optimal Speaker Independent Speech Recognition System

The findings in this thesis suggest an optimal system for the speaker independent recognition of spoken letters from the English alphabet using a reduced architecture neural network. This system consists of a preprocessing portion using the trace segmentation time alignment algorithm and a scaly architecture neural network with 35 frames in the input layer, 34 frames in the hidden layer, an input zone of 2 frames and an input zone overlap of one zone. Results were only obtained on such a system when recognizing the E-set from the English alphabet. To test the findings the system is tested on the recognition of the full English alphabet. Using the sigmoid01 activation function and the output node target values 0 and 1 a performance of 95.85% was achieved on the training data and a performance of 76.46% was achieved on the test data. Woodland achieved performance rates between 80% to 87% in speaker dependent recognition with a fully connected network and no weight limiting. The scaly network still performs significantly worse than the fully connected network but the performance has been improved upon using the findings in this thesis to select an optimized architecture. Woodland employs methods of weight limiting and preprocessing of the speech data to have zero mean and a variance of one third. Much better performance is obtained for the networks presented with this preprocessed data . Better results may be possible with the scaly network if presented with data preprocessed in a similar manner and scaling of the input data will results in faster convergence.

Utilizing a larger set of training data may be found to decrease the large discrepancy between the performance on the test data as compared to the training data. This difference is most likely indicative of the fact that the network has become too specialized in recognizing the contents of the training data set. A larger set of training data should decrease the probability of overtraining the network and lead to better generalization.

This network is optimized for the task of recognizing the letters of the English Alphabet. For other vocabularies a similar study would have to be performed to determine the optimal features of a scaly network applied to the recognition of that vocabulary. The neural network architecture in this thesis is investigated with the task of recognizing letters spoken in isolation in mind. This may be of use in applications such as word verification. For example, when someone is using directory services on the phone and clarification of a name is required a neural network could be used to recognize the letters spelling the name. For this reason the use of the scaly neural network for other vocabularies and for continuous speech recognition is considered beyond the scope of this thesis.


7.6 Future Work

The trace segmentation (TS) time alignment algorithm performs well in a speech recognition system where the task is to recognize isolated letters from the English alphabet. Performance of this speech recognition system is comparable with a system using the more widely known and used dynamic time warping algorithm (DTW). In some cases, better results are achieved with the TS system.

The TS algorithm is much less complex and computationally intensive than the DTW algorithm so it would be worthwhile to investigate its performance on other examples of speech. In this study, only letters of the English alphabet were investigated but the performance of the TS algorithm on whole words would prove informative.

The DTW algorithm has been developed further so that it can be utilized in the implementation of connected speech recognition systems. It would be advantageous to develop a version of the much less complex TS algorithm which could cope with connected speech.

The scaly architecture provides a suitable network to deal with the recognition of letters from the English alphabet in speaker dependent recognition. It proves less capable at recognizing new samples of speech as would be required for a speaker independent recognition system. Reduced architecture systems are more practical for real life applications so good performance is important. Such a system is more suitable for real life applications since it is less costly computationally for software implementations. If a hardware implementation is required a reduced architecture network proves less costly since it requires far less components and is more easily scalable.

The scaly architecture did give promising results for speaker independent recognition but they fall significantly short of those achieved by a fully connected network. An angle to be explored in the enhancement of the performance of the scaly network is the use of improved error back propagation algorithms. Much work has been done on the use of these optimized algorithms such as the work by Ngolediage et al [38]. Several of these algorithms are described in the paper by Gallant [14].
[Back] [Contents Page] [Next]