[Back] [Contents Page] [Next]


Chapter 4 Investigation Of Time Alignment Algorithms



4.1 Introduction

As described in chapter 3, speech pattern time axis variation is a major problem of speech recognition systems. Two algorithms available for dealing with this, dynamic time warping (DTW) and trace segmentation (TS), are explained in detail.

Both of these methods are examined and used in conjunction with a scaly and a fully connected neural network for speaker dependent recognition of letters from the English alphabet database. Their performance and relative merits are then compared to determine one method which is the most suitable for the task. The performance of the fully connected and the scaly networks are compared to determine whether the reduced architecture incurs any drop in performance and if so, how much of a drop.

The use of different transfer functions within the processing units of the neural network is examined and several versions of the sigmoid function compared. Variance of the learning rate in the error back propagation algorithm is also investigated by comparing the performance of the neural network when three different values of learning rate are used.


4.2 The English Alphabet Database

The source of the data for training and testing the neural network is a database from the British Telecommunications Research Laboratories (BTRL). It consists of each of the letters of the English alphabet being spoken three times by 104 speakers.

The database was recorded at BTRL using British Telecom employees as the speakers. The speakers consist of 53 males and 57 females of various ages. The data is provided in two formats: a speaker independent training and test form and a multiple speaker training and test form.

In multiple speaker form, the first two utterances of each letter spoken by every speaker are used to train the network and the remaining utterance is used to test the network. This means that the network is tested using utterances spoken by voices from which it has already seen examples. Therefore, the speech database contains 5302 utterances which can be used for training the network and 2674 utterances which are available for testing.

In speaker independent form, 52 speakers are designated training talkers and 52 are designated testers. The three examples of each letter spoken by each of the trainers are used to train the network and the three examples of each letter spoken by the testers are used to test the trained network. The database is designed such that there is a balance of men and women in the trainer and tester sets. This form tests the network's performance on utterances spoken by voices it has never heard before. This results in 3999 utterances in the database to use for training the network and 3977 which are available for testing.

The speech signals are recorded in a low noise environment with high quality recording equipment. The signals are sampled at 20KHz using a 16-bit A/D converter. The speakers are prompted in a random order to say each letter and after each prompt two seconds of data is recorded to disk.

Accurate endpointing of speech data is a difficult task but reasonable results can be achieved in isolated word recognition where the input data is surrounded by silence. Some of the typical algorithms used to perform endpointing look at the energy of the incoming signal. Other algorithms will examine the number of zero-crossings which is the number of times that the audio signal changes from positive to negative and vice versa. When the energy of the signal or the number of zero-crossings reach a certain level it is reasonable to assume that a speech signal is present. In this case, an automatic endpointing routine gives putative endpoints for each utterance. This is then checked by a human operator and adjusted if required. "Bad" utterances are discarded at this stage, that is, if the wrong letter is spoken or if the recording is clipped.


4.3 Mel Frequency Cepstral Coefficients

The digitized sound signal contains a lot of irrelevant information and requires a lot of storage space. To simplify the subsequent processing of the signal, useful features must be extracted and the data compressed. The power spectrum of the speech signal is the most often used method of encoding. The human ear performs something very similar to a Fourier Transform on incoming sounds signals before passing the information on to the brain for analysis [23].

Before transforming the speech signal into its power spectrum it must be divided up into overlapping blocks. Here, a hamming window is used which is a raised cosine where,


[Equation]


Mel Frequency Cepstral coefficients (MFCCs) are a used to encode the speech signal. Mel scale frequencies are distributed linearly in the low range but logarithmically in the high range which corresponds to the physiological characteristics of the human ear [56]. Cepstral analysis calculates the inverse Fourier transform of the logarithm of the power spectrum of the speech signal.

For each utterance, the Cepstral coefficients are calculated for frames of 512 samples with successive frames being overlapped by 256 samples. Each frame is pre-emphasized then windowed by the hamming window to minimize the endpoint effects of chopping a 512 sample section out of the speech signal. A Fast Fourier Transform (FFT) is used to calculate the discrete magnitude spectrum. The energy values in 26 overlapping Mel spaced frequency bands are calculated. This results in each frame being represented by 8 MFCCs. This is a data reduction of 512 samples to 8 coefficients.

The data is used in its raw form as presented in the database. No further preprocessing is performed on the data before the time alignment methods are employed.

4.4 The Neural Network Simulator

The neural network simulator used is called Galatea and is part of the ESPRIT II 2059 Pygmalion project. Galatea consists of a set of C libraries which provide efficient versions of all the most common neural network algorithms used for speech and image recognition. The version used contains algorithm modules for Hopfield networks, Kanerva Associative Memory, Linear Associative Memory, Adaptive Resonance Theory and the algorithm of interest in this thesis, Gradient Back Propagation. At the time this version was released algorithm modules were being written for Linear Vector Quantization and Kohonen Topological Maps.

The back propagation simulation module offers a choice of three different non-linear activation functions. These are :

sigmoid01 : f(x) = 1/(1 + exp(-x)) (shown in figure 4.1)
sigmoid11 : f(x) = (1 - exp(-x)) / (1 + exp(-x)) (shown in figure 4.2)
standard : f(x) = 1.71 * tanh(0.666x) (shown in figure 4.3)


[The Sigmoid01 Function]

Figure 4.1 : The Sigmoid01 Function



[The Sigmoid11 Function]

Figure 4.2 : The Sigmoid11 Function



[The Standard Sigmoid Function]

Figure 4.3 : The Standard Sigmoid Function





4.5 Experimental Procedure

The data is used in the multiple speaker training and test form which, as described in section 4.2, involves using the first two utterances spoken by each speaker for training and the third utterance for testing.

Training the networks to recognize the complete English Alphabet is extremely time consuming, so initially the networks are trained to recognize smaller subsets of letters, {A, B} and then {A, B, C}. The performance of the network when different activation functions and learning rates are used is compared. The networks can then be trained to recognize the complete English alphabet and the results obtained to compared to those for the subsets of letters to see if they provided a good indication of performance on the more difficult task.

The mean length of an utterance over the entire data set is calculated and found to be 35 frames. In using both methods of time alignment all utterances are warped to this mean length. For the dynamic time warping algorithm an utterance from each category of letter to be recognized is obtained to act as the reference pattern vector against which all other utterances of that category are warped. Therefore, the architecture of the both the scaly and fully connected networks requires an input layer of 280 input neurons to accommodate the 35 input feature vectors which each contain 8 coefficients. For the scaly architecture, the number of frames in a zone is taken as 10 frames with an overlap of 5 frames so that the number of frames required in hidden layer is 6. The hidden layer therefore consists of 6 frames each of 8 neurons equaling 48 neurons total. The number of output classes is 26 so 26 neurons are required in the output layer. This scaly architecture is illustrated in figure 2.3. The fully connected network used for comparison of performance has the same layer sizes as the scaly network (280-48-26) but the output of every node in the input layer is connected to the input of every node in the hidden layer and the output of every node in the hidden layer is connected to the input of every node in the output layer. The fully connected architecture is illustrated in figure 4.4.

[Fully Connected Neural Network Architecture]


Figure 4.4: Fully Connected Neural Network Architecture


As described in section 4.4, the neural network simulator used offers a choice of three non-linear activation functions. The first is the sigmoid01 function which, it can be seen from figure 4.1, has an upper limit of 1 and a lower limit of 0 so the desired outputs of the network will be 0 and 1. The desired outputs are generated such that a +1 is at the output of the neuron of the correct category and a 0 on all the other output neurons. For example, the desired outputs for the letter A are :-

+1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The second non-linear activation function is the sigmoid 11 function which, it can be seen from figure 4.2, has an upper limit of 1 and a lower limit of -1 so the desired outputs are generated such that a +1 is at the output of the neuron of the correct category and a -1 on all the other output neurons. For example, the desired outputs for the letter A are :-

+1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

The third choice of non-linear activation function offered is the standard sigmoid function which, it can be seen from figure 4.3, has an upper limit of +1.71 and lower limit of -1.71. The desired outputs are generated such that a +1.71 is at the output of the neuron of the correct category and a -1.71 on all other output neurons. For example, the desired outputs for the letter A are :-

+1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71 -1.71

The networks are trained using each of the three values for the learning rate e, 0.005, 0.008 and 0.01, for the activation function sigmoid01 over 40 training sweeps to recognize the letters A and B. In each training sweep the complete training set is presented to the network. With each presentation of a training pair to the network the weights are adjusted until the the change in the weights between each iteration is less than a preset tolerance value (0.1 here) or the number of iterations becomes equal to a maximum value (1000 iterations here). If the difference in weights between iterations is less than the tolerance after only one iteration then this indicates that the convergence criterion for the network is not being met and the network is effectively no longer learning.

The performance and error on the training set are recorded after every sweep and then the test set is presented to the network and its performance and error on these samples recorded. Training is then extended to 300 sweeps so that the performance of the network on the training data is attaining reasonably high levels. Performance and error are recorded after every set of 5 training sweeps. The same procedure is then followed for the functions sigmoid11 and standard.

The same procedure just described for training various networks to recognize A and B is repeated, this time training the network to recognize A, B and C with training being extended to the same number of sweeps.

The same procedure is again repeated this time training the network to recognize all the letters of the English Alphabet with training being extended to 1000 sweeps.


4.6 Simulation Results

4.6.1 Training on 'A' and 'B'

Tables 4.1 and 4.2 show the performance and error after 300 training sweeps when a scaly architecture net is employed. Table 4.1 shows the results when the network is being presented with data from the training set, i.e. samples of speech which it has already seen before. Table 4.2 shows the results when the network is being presented with samples of speech from the test set which it has never seen before. Tables 4.3 and 4.4 show the results when a fully connected architecture is used with Table 4.3 showing the results on the training set and Table 4.4 results from the test set.

The sigmoid11 function with output values -1 and +1 is the best performer achieving 100% performance on the training data for both network architectures, all time alignment algorithms and all values of e within the 300 training sweeps. It also results in the best performance on the test data achieving 97.1% for the scaly network (with the dynamic time warping algorithm, slope constraint of 1, and e = 0.01) and 96.14% for the fully connected network (with the trace segmentation algorithm and e = 0.01). Overall the performances achieved when the fully connected network is used are about equivalent to those achieved using the scaly network on the training data set. When the scaly architecture is employed, performance on the test data set is overall better or equivalent to that for the fully connected network.

When the standard sigmoid function is used with the fully connected network the message is given that the update units are unbound. The neural network simulator allows convergence criterion to be determined for the neural network, default values are a tolerance of 0.1 and a total number of iterations of 1000. During training the weights are updated and the new state of the network compared to the old state. If the difference is greater than the tolerance the weights are updated again. If after 1000 iterations the difference between the old state and the new state is still greater than the tolerance then the unit is considered unstable and the message "Update units: Unbound" is given.


Table 4.1 Error And Performance Of The Scaly Network In Recognising The Letters 'A' And 'B' From The Data Training Set After 300 Training Sweeps

[Error And Performance Of The Scaly Network In Recognising The Letters
         'A' And 'B' From The Data Training Set After 300 Training Sweeps]



Table 4.2 Error And Performance Of The Scaly Network In Recognising The Letters 'A' And 'B' From The Data Test Set After 300 Training Sweeps

[Error And Performance Of The Scaly Network In Recognising The Letters 'A' And 'B'
         From The Data Test Set After 300 Training Sweeps]



Table 4.3 Error And Performance Of The Fully Connected Network In Recognising The Letters 'A' And 'B' From The Data Training Set After 300 Training Sweeps

[Error And Performance Of The Fully Connected Network In Recognising The Letters 'A'
         And 'B' From The Data Training Set After 300 Training Sweeps]



Table 4.4 Error And Performance Of The Fully Connected Network In Recognising The Letters 'A' And 'B' From The Data Test Set After 300 Training Sweeps

[Error And Performance Of The Fully Connected Network In Recognising The Letters 'A'
         And 'B' From The Data Test Set After 300 Training Sweeps]



Both time alignment algorithms perform about equally on the training data set with any differences being less than 1%. On the test set there is a more marked difference between the performances achieved. With the scaly architecture network, the best performance on the test set is achieved by the DTW with slope=1. The TS algorithm achieves the best performances with the fully connected network.

The error value is the least mean squared error between the actual outputs and the desired outputs of the network. In comparing the error achieved with the different network configurations the results follow the same pattern as those observed for performance. In most cases the criterion producing the best performances will also produce the lowest or very close to the lowest error values.

The change in performance and error of the neural network as the number of training sweeps increases is of interest. Performance and error versus the number of training sweeps is plotted for all of the networks trained and several examples of these plots can be found in Appendix A (Figures A.1, A.2, A.3 and A.4). These plots show the standard convergence behaviour found to be the case for all of the networks trained.


4.6.2 Training on 'A', 'B' and 'C'

Tables 4.5 and 4.6 show the performance and error after 300 training sweeps on a scaly neural network for the training and test sets respectively. Tables 4.7 and 4.8 show the results when a fully connected neural network is employed.

All sigmoid functions perform well on the training set, achieving performances greater than 97% overall and in most cases (greater than two-thirds) a performance of greater than 99% is attained. The sigmoid01 and sigmoid11 functions with perform best with the scaly architecture on the test set both achieving the highest performance of 96.13%. However, the sigmoid01 function achieves a lower error which is desirable since it means the actual outputs on the sigmoid01 network are closer to the desired outputs. With the fully connected architecture the sigmoid11 function performs best on the test set achieving 96.45% performance and an error of 0.00529.


Table 4.5 Error And Performance Of The Scaly Network In Recognising The Letters 'A' , 'B' And 'C' From The Data Training Set After 300 Training Sweeps

[Error And Performance Of The Scaly Network In Recognising The Letters 'A' , 'B' And
         'C' From The Data Training Set After 300 Training Sweeps]



Table 4.6 Error And Performance Of TheScaly Network In Recognising The Letters 'A', 'B' And 'C' From The Data Test Set After 300 Training Sweeps

[Error And Performance Of TheScaly Network In Recognising The Letters 'A', 'B' And
         'C' From The Data Test Set After 300 Training Sweeps]



Table 4.7 Error And Performance Of The Fully Connected Network In Recognising The Letters 'A', 'B' And 'C' From The Data Training Set After 300 Training Sweeps

[Error And Performance Of The Fully Connected Network In Recognising The Letters 'A',
         'B' And 'C' From The Data Training Set After 300 Training Sweeps]



Table 4.8 Error And Performance Of The Fully Connected Network In Recognising The Letters 'A', 'B' And 'C' From The Data Test Set After 300 Training Sweeps

[Error And Performance Of The Fully Connected Network In Recognising The Letters 'A',
         'B' And 'C' From The Data Test Set After 300 Training Sweeps]



As with the network trained to recognize the the letters A and B, when a fully connected network is used with the standard sigmoid function the message is given that the update units are unbound which means the network has not met the convergence criterion and is no longer learning.

With this recognition task there is still not much difference in the performance of the time alignment algorithms on the training set. The differences are again more obvious on the test set and the same pattern is seen as in section 4.6.1 The DTW algorithm performs best with the scaly architecture and the TS algorithm performs best with the fully connected architecture.

The speed of learning and the final values of performance and error are not as good as for the network trained to recognize just the letters 'A' and ' B'. This is to be expected due to the extra learning incurred with the addition of the letter 'C' to the recognition task and the added difficulty of the task.

Again, performance and error versus the number of training sweeps is plotted for all of the networks trained and several examples of these plots can be found in Appendix A (Figures A.5, A.6, A.7, A.8, A.9 and A.10). These plots show the standard convergence behaviour found to be the case for all of the networks trained.


4.6.3 Training On The Whole Alphabet

Tables 4.9 and 4.10 show the performance and error after 1000 training sweeps on a scaly network for the training sets and tests sets respectively. Tables 4.11 and 4.12 show the results when a fully connected neural network is employed.

When the network is trained to recognize the whole English alphabet the trace segmentation (TS) algorithm gives the best results overall. When the scaly architecture is employed the TS algorithm achieves performances several percent greater than the best attained using the DTW algorithm. With the fully connected network the TS algorithm still performs better than the DTW algorithm although the difference in performances achieved between the two is much less, in most cases less than 1%. This is true for both the training data set and the test data set.


Table 4.9 Error And Performance Of The Scaly Network In Recognising The Full English Alphabet From The Data Training Set After 1000 Training Sweeps

[Error And Performance Of The Scaly Network In Recognising The Full English Alphabet
         From The Data Training Set After 1000 Training Sweeps]



Table 4.10 Error And Performance Of The Scaly Network In Recognising The Full English Alphabet From The Data Test Set After 1000 Training Sweeps

[Error And Performance Of The Scaly Network In Recognising The Full English Alphabet
         From The Data Test Set After 1000 Training Sweeps]



Table 4.11 Error And Performance Of The Fully Connected Network In Recognising The Full English Alphabet From The Data Training Set After 1000 Training Sweeps

[Error And Performance Of The Fully Connected Network In Recognising The Full English
         Alphabet From The Data Training Set After 1000 Training Sweeps]



Table 4.12 Error And Performance Of The Fully Connected Network In Recognising The Full English Alphabet From The Data Test Set After 1000 Training Sweeps

[Error And Performance Of The Fully Connected Network In Recognising The Full English
         Alphabet From The Data Test Set After 1000 Training Sweeps]



There is a marked difference in network performance between the scaly network and the fully connected network. The fully connected network achieves performances of above 90% on the training data and above 80% on the test data. The scaly network only achieves a performance of 67.67% at best on the training data and 64.25% on the test data.

As for the previous results on sections 4.6.1 and 4.6.2, no pattern can be seen to indicate which error value gives better results in all cases. Different sigmoid functions give better results depending on the network being utilized. For the scaly network the sigmoid01 function gives the best performance whereas for the fully connected network the sigmoid11 function gives the best results.

Performance and error versus the number of training sweeps is plotted for the networks trained and these plots can be found in Appendix A (Figures A.11, A.12, A.13 and A.14). These plots show the convergence behaviour for the networks trained to recognize the whole alphabet.


4.7 Discussion

4.7.1. Effect Of Architecture On Performance Rates

For small sets of letters, high performance rates and low error levels are achieved by the scaly architecture. Performance does however fall significantly for the task of recognizing all 26 letters of the alphabet. The networks ability to learn decreases with the addition of one extra letter between learning to recognize 'A' and ' B' and learning to recognize 'A', ' B' and ' C'. The decrease in ability to learn for the whole alphabet is therefore to be expected. As the number of training sweeps is increased further, the performance for this task should improve but the cost in learning will be greatly increased. There is a greater drop in performance with the scaly architecture compared to that which occurs with the fully connected architecture. The fully connected network converges far more smoothly than the scaly network for which performance and error fluctuate as the network is trained. At one point the performance of the scaly network reaches a peak and then begins to fall as the network is trained further. Beyond 800 training sweeps performance starts to rise slowly again and the error falls. Only one type of scaly architecture is implemented here but many possible variations can be realized by changing the number of neurons in a zone and the overlap between zones. Variation of the scaly architecture and the resultant effect on the performance of the neural network is investigated in detail in Chapter 5.

The fully connected architecture may be achieving lower performance rates in some cases for the smaller sets of letters because of the lower amount of data available for training. This is further suggested by the fact that the fully connected network performs better than the scaly network on the training data but in many cases has lower performance on the test data. When it become so specialized in recognizing the training data it has poorer generalization and achieves lower performance rates on examples of speech it has never seen before.


4.7.2. Effect Of The Activation Function

In each set of network simulations, the sigmoid01 function and the sigmoid11 function both outperform the standard sigmoid function. They never result in the network reaching a state where it no longer appears to be learning which occurs when the standard sigmoid is employed with a fully connected architecture. For the smaller subsets of letters the sigmoid11 function achieves the best performances in most cases. This is not the case for the much harder task of learning to recognize the whole alphabet. As mentioned in section 4.6.3 there is a definite pattern. The sigmoid11 function gives the best performances with the fully connected architecture and the sigmoid 01 function gives the best results for the scaly architecture.

Networks that use different sigmoid functions are known to learn at different rates. The non-symmetric sigmoid01 function does not learn as efficiently as symmetric functions such as the sigmoid11 [56]. The sigmoid01 function gives a non zero mean activation which means the network initially has to spend time pushing its biases into a meaningful range. Differences in performance achieved by the activation functions may be due to the fact that the networks are converging at different rates and have not fully converged when training is ended.

For both the sigmoid11 and sigmoid01 functions, with small sets of letters, the learning curves are fairly smooth with rapid early learning and little fluctuation up and down. The standard sigmoid learning curves fluctuate up and down which results in slow early learning, lower performance rates and higher error levels compared with the other two functions. When trained to recognize the whole alphabet, it can be seen that the learning curves for the network are no longer smooth but fluctuate up and down. It would appear therefore that the increased amount of categories to be recognized leads to greater fluctuations.


4.7.3. Effect Of Learning Rate

There does not appear to be any discernible pattern to the performance and error achieved when different values of the learning rate e are used. Overall, e=0.01 gives the highest performance rates coupled with low error levels but there are occasions when the other learning rates give equal or greater performance and equal or lower error rates. There is often only a small difference between the performances and errors achieved using the different error rates. It is advisable to find which other methods and parameters work best for the task of recognizing the letters of the alphabet and then optimize that network's performance by varying the learning rate.


4.7.4. The Trace Alignment Method

For the task of learning to recognize the smaller subset of letters, the dynamic time warping algorithm achieves the best results overall with its slope set to 1. When examined more closely it can be seen that the trace segmentation algorithm does achieve performances equivalent or very close to those of the dynamic time warping. In most cases the difference in performance is less than 1%. When the networks are trained for the much harder task of recognizing the whole English alphabet the trace segmentation outperforms the dynamic time warping algorithm. This is a very desirable result since the trace segmentation algorithm is much simpler than the dynamic time warping algorithm. This means that it offers a saving in computational cost over the dynamic time warping as was demonstrated in chapter 2 (section 2.2) without a significant drop in performance.

The time segmentation algorithm also offers the advantage of not requiring a reference pattern vector. There is an inherent danger in the choosing of a reference pattern since it should be ensured that this is not a peculiar example of the particular utterance. Input utterances should be warped against a reference which represents a near average example of the features of the letter in question. This is not well addressed in the procedure described in this chapter and better results may have been achieved by more consideration of this issue. In a practical implementation all input utterances to be recognized would have to be warped against a reference utterance for all possible categories that the utterance might belong to. This adds a much greater degree of complexity to the system.which would not be the case in a time segmentation based system.

As mentioned in section 3.2.1, the same database was used to test a neural network being used in conjunction with a linear time alignment algorithm by Woodland [61]. The best performance achieved in recognizing the whole alphabet when the database was used in multiple speaker configuration was 91.0% in the test set. This result was achieved with when a maximum weight value limit of 1.5 was employed. Without using a weight limit a performance of 84.4% was achieved with 25 hidden nodes and 87.3% with 50 hidden nodes. The best performance achieved on the test set in the work described in this chapter was 87.77% with the trace segmentation algorithm and a fully connected neural network architecture. The neural network simulation software being used did not allow the implementation of a maximum weight limit so comparing the results obtained to that of Woodland with no maximum weight limit comparable performance is achieved.


4.8 Conclusions

The scaly type architecture neural network has been shown to be suitable for the recognition of isolated words in the form of letters of the English alphabet achieving high recognition rates for small vocabularies. When it came to the harder task of recognizing the whole English alphabet the scaly architecture could not compete with the fully connected architecture. Many more permutations of the scaly architecture than were looked at here are available and further investigation of the scaly architecture is carried out in chapter 5.

The trace segmentation was shown to be a good choice for use as a time alignment algorithm with the neural network. It offers high computational saving over the dynamic time warping algorithm with very little or no drop in performance. In some cases it actually outperforms the dynamic time warping algorithm.

As mentioned previously no clear pattern was identified to indicate which value of the learning rate e gives the best results so in chapter 5 the other parameters will be optimized first and then the best learning rate determined.

The sigmoid functions sigmoid11 and sigmoid01 both perform better than the standard sigmoid function in all cases . Overall, the sigmoid11 function gives the best performances but there are occasions where the sigmoid01 function achieves close to or better performances so no conclusion can be drawn as to which is the best choice.

The subsets of letters {A & B} and {A, B & C} did not provide clear indications as to which parameters would perform best on the much harder task of recognizing the whole English alphabet. There may have been too much of a jump in the difficulty of the task. In chapter 5 a different subset of letters is used to see if they provide a better indication.

Recognition of the letters was carried out in multiple speaker mode as explained in section 4.2. For the work described in chapter 5, the data will be used in speaker independent mode since this is much is more useful in real world situations. In this mode the test data presented to the network has been spoken by speakers never before heard by the network. This is a much more practical implementation for future use.


[Back] [Contents Page] [Next]