Chapter 3 Preprocessing Of The Speech Data3.1 IntroductionAs mentioned in section 2.3, two of the major problems in speech recognition systems have been due to the fluctuations in the speech pattern time axis and spectral pattern variation [51]. Speech is greatly affected by differences in the speaker such as age and sex as well their physical and psychological condition. The pitch and speed of the speaker will be altered by the way they feel at the time. If they are agitated, angry or short of time they will most likely speak at a faster rate and higher pitch than when they are calm and relaxed [23].Speaking rate variation results in non-linear fluctuations in the speech pattern time axis. The length of the input pattern to the neural network in question is constrained by the number of input neurons to the neural network since this type of network architecture cannot be varied once trained. The input pattern vectors must be modified to fit the neural network while still retaining all their discriminating features. Various algorithms are available to perform this time alignment of the input pattern to the neural network and the performance of the neural network is dependent upon the performance of the time alignment algorithm used. In this chapter, the various types of time alignment algorithms are described and their operation is outlined in detail. 3.2 Time Alignment AlgorithmsThe simplest time alignment algorithms are linear. However, they take no account of the importance of the feature vectors within the pattern vector when deleting or duplicating them to shorten or lengthen the pattern vector if required. Important features, therefore, may be lost in the process.On the other hand, non-linear time alignment algorithms are more complicated and involve higher computational expenditure. Their advantage lies in the fact that they recognize important features and attempt to retain these features in the time aligned pattern vector. Two non-linear time alignment algorithms are dynamic time warping (DTW) and trace segmentation (TS). 3.2.1 Linear Time Alignment Linear time alignment algorithms are the simplest algorithms to implement and they can be used for both expansion or compression of the speech pattern vector. There are various ways of implementing linear algorithms but all use the basic method of deleting feature vectors to shorten the speech pattern and duplicating feature vectors to length the speech pattern. An example is to duplicate or delete vectors at regular intervals along the pattern vector until the speech pattern is the correct size. An example of a linear algorithm used in conjunction with a neural network is that of Woodland [61]. Woodland achieved recognition rates of 91% for multiple speaker recognition and 88.3% for speaker independent recognition. These tests were carried out on the BT English alphabet database used in this thesis. 3.2.2 Dynamic Time Warping Dynamic time warping (DTW) for time alignment was introduced by Sakoe and Chiba in 1978 [49] when it was used in conjunction with dynamic programming techniques for the recognition of isolated words. DTW has been used in conjunction with a neural network by Sakoe et al in 1989 [51] for the recognition of isolated words in the form of Japanese digits. The DTW algorithm removes timing differences between speech patterns by warping the time axis of one speech pattern until it maximally coincides with the other. All pattern vectors are warped against a reference pattern vector of the same category which has the same number of feature vectors as there are frames in the input layer of the neural network. After the relevant feature extraction has taken place, speech patterns can be represented as a sequence of feature vectors, Let A be the reference speech pattern and B be the pattern vector to be aligned against A. Figure 3.1 shows A and B developed against the i and j axes. ![]() Consider a warping function F between the input pattern time j and the reference pattern time i, where A measure of the difference between the two feature vectors ai and bjis the distance When the warping function is applied to B this distance becomes where:       b'j is the jth element of B after the warping function has been applied. The weighted summation of these distances on the warping function is ![]()       w(i) is a nonnegative weighting coefficient. E reaches a minimal value when the warping function is determined to optimally align the two pattern vectors. The minimum residual distance between A and B is the distance still remaining between them after minimizing the timing differences between them. The time normalized difference is defined as follows: ![]() Certain restrictions are applied to the warping function to ensure that it approximates the properties of actual time axis fluctuation. This means it should preserve all the significant linguistic features present in the speech pattern being warped. Such properties are monotonicity and continuity [47]. These can be realized by imposing the following conditions on the warping function E . Monotonic conditions Continuity conditions Boundary conditions are imposed as follows; An adjustment window is implemented such that where r is a positive integer. The adjustment window condition is imposed since the time axis fluctuation does not yield excessive timing differences, therefore the algorithm must do likewise. The final constraint imposed is the slope constraint condition. The results of this condition is that if b'j(i) moves forward in one direction, m times consecutively, then it must step n times in the diagonal direction before it can step any further in that direction. This ensures a realistic relation between A and B by ensuring that relatively short segments of one are not mapped to relatively long segments of the other. The intensity of slope constraint is measured as follows; The warping function slope is more rigidly restricted by increasing P but if it is too severe then time normalization is not effective. The denominator of the time normalized distance equation can be defined as: ![]() ![]() Minimization can be achieved by applying dynamic programming principles. There are two typical weighting coefficient definitions which allow this simplification, one for symmetric time warping and one for asymmetric. In symmetric time warping the summation of distances is carried out along a temporarily defined time axis l = i + j . The previous discussion has described asymmetric time warping where the summation is carried out along the i axis warping B to be of the same size as A. In asymmetric time warping the weighting coefficient is defined by the following When the warping function attempts to step in the direction of the j axis the weighting coefficient reduces to 0, since therefore, and when the warping function steps in the direction of the i axis or the diagonal, then, then Applying Dynamic programming principles to the simplified time normalization equation gives the following algorithm for calculating the minimal value of the summation: The dynamic programming equation is The dynamic programming equation for P = 0 is ![]() ![]() For P = 1 the dynamic programming equation will be ![]() For P=2 the dynamic programming equation will be ![]() The entailing result is that, for the initial condition, the first feature vector of B is taken as the first feature vector of the warped pattern vector b1'. Subsequent feature vectors for the warped pattern vector are chosen such that the nth feature vector is that feature vector from the input pattern vector B closest to the nth feature vector of the reference pattern vector. The asymmetric dynamic time warping algorithm only provides compression of speech patterns. This means that a linear algorithm must be used with any speech patterns that need to be expanded. This is acceptable since no feature vectors are being deleted when a linear algorithm is used in this form and there is no danger of losing important features of the speech pattern. 3.2.3 Trace Segmentation Trace segmentation (TS) was introduced by Kuhn et al in 1981 and was used on its own and in conjunction with dynamic programming (DP) methods for isolated word recognition [27]. Two databases were used to train and test the speech recognition systems. The TS algorithm on its own performed worse than DTW on its own yielding error rates of around 10%. When the TS algorithm was used as a preprocessing step to DP it performed better than the DP alone. This method was also found to offer savings in computational expenditure of a factor of 10 or more over the DP algorithm used on its own. TS is based on the assumption that despite timing differences, for speech signals of the same category, fluctuations in the frequency spectrum with time will occur in the same sequence but over different lengths of time. Assuming speech patterns of the same form as A and B in the previous section e.g. and also assuming that each feature vector contains N features which can be represented as a point in N-dimensional space, the speech pattern can therefore be seen as a trace of points in N-dimensional space. Where there is no change in frequency there will be a high density of points and where the frequency changes are rapid the points will be widely spaced. The number of feature vectors in a trace can be reduced by removing those which occur during the stationary portions of the speech pattern. Kuhn et al achieved this by summing the Euclidian distances between successive feature vectors of a pattern vector to give the total length Dof the trace [27]. If F feature vectors are required in the time aligned pattern vector then D is divided into F-1 segments of length L where The most suitable vectors from the pattern vector are selected as follows. The first feature vector of the input pattern vector is taken as the first feature vector of the time aligned pattern vector. Successive feature vectors of the time aligned pattern vector are then chosen such that the Euclidian distance between each of them is as close to L as possible. The final time aligned pattern vector should therefore consist of F feature vectors with Euclidian distances between them of approximately L. Lienard and Soong used the TS algorithm for the recognition of the P-set from the English alphabet which consists of the letters "P', "B", "T", "D", "V" and "Z" spoken by 4 speakers and obtained promising results [30]. Nadeu et al also applied TS to isolated word recognition of the ten Catalan digit words spoken by one speaker [37]. It was found that the recognition did not significantly degrade when TS was applied to the speech signal prior to recognition unless the number of frames removed from the speech signal by the algorithm was very high. Gauvin and Mariani applied the TS algorithm to connected speech recognition comparing it to a linear time alignment algorithm and another non-linear time alignment algorithm, comparing fixed length and variable length versions of the algorithms [13]. The variable length trace segmentation algorithm gave the best recognition results overall The TS algorithm was used in conjunction with a neural network by Demichelis et al [10] for the recognition of isolated digits. They achieved 86.4% performance and compared this to an hidden Markov model (HMM) approach where a 95% recognition rate was achieved. TS is used in conjunction with a neural network for the research described in chapters 4 and 5. In chapter 4 the DTW time alignment algorithm and the TS algorithm are compared when used in conjunction with neural networks. As with the asymmetric DTW, trace segmentation algorithm only provides compression of speech patterns. This means that a linear algorithm must be used with any speech patterns that need to be expanded. As mentioned previously, this is acceptable since no feature vectors are being deleted when a linear algorithm is used in this form and there is no danger of losing important features of the speech pattern. 3.3 SummaryThe concept of time alignment algorithms for preprocessing the speech pattern before it is presented to a neural network was introduced in this chapter. The two types of algorithm available, linear and non-linear, were described. Possible implementations of a linear algorithm were outlined and two non-linear algorithms were described in detail. The first was the widely used dynamic time warping algorithm which has been used both with and without neural networks for speech recognition and the less well known trace segmentation algorithm. Non-linear algorithms take into account the importance of feature vectors when adding or discarding them to change the length of speech patterns. This means they offer less danger of losing important features of the speech signal. For this reason, non-linear algorithms are the main interest of this thesis. The trace segmentation algorithm offers great computational savings over the dynamic time warping algorithm so in chapter 4 the performances of the two are compared to find out if there is any loss in performance incurred by the TS algorithm. |