Tone Modeling for Thai Dysarthric Speech Recognition

Nuttakorn Thubthong¹ and Prakasith Kayasith²
¹Department of Physics,
Chulalongkorn University,
Bangkok, 10330, THAILAND.
²Department of Information Technology,
Sirindhorn International Institute of Technology,
THAILAND.
E-mail: nuttakorn.t@chula.ac.thand p.kayasith@nectec.or.th

Abstract

Tone information in tone languages such as Thai has been proved in many researches to be helpful for the improvement speech recognition system performance. In this paper, we propose a new approach to exploit the tone features to dysarthric speech recognition for Thai language. The experiments, with three datasets of digits, adverbs and verbs, are designed to study an effect of tone features on dysarthric speech recognition system. By taking an advantage of the tone feature, the experiments show an enhancement of the speech recognition performance.

Keywords:

Dysarthric Speech, Thai, Tone Models, Speech Recognition

1. Introduction

Dysarthric speakers usually lose control of their speech articulator system caused by the neural motor disorder. The symptoms include poor control of individual articulator movements and poor coordination between articulators. Overall effects on speech are repetition, respiration, phonation and prosody distortion, excesses resonance (hypernasality), and periods of extraneous silence and non speech sounds [ 3 ]. However, the study of Rupal Patel [ 4 ] suggests that people who may not be able to produce a variety of clear sounds of phonemes, may be able to have enough control over their prosody features, i.e., in our case, tone features. Some previous works [ 2 ] have already shown that a small vocabulary speech recognition which is specifically trained individually for CP-children with dysarthria, can be used to enhance a quality of communication between the children and unfamiliar conversation partners.

The research on speech recognition for Thai dysarthria is a new area. Thai is a tone language. The phonetic structure of Thai is primarily based upon the monosyllable with five lexical tones. mid, low, falling, high, and rising. With this special features we have for Thai, we are interested in incorporating tone information to speech recognition system to improve a speech recognition performance for dysarthria.

For practical use of system, we start our works on a group of children with cerebral palsy. The common characteristics of this study group for Thai speech are distortion or missing of initial consonants and final consonants associated with the involuntary insertion of breaths and silences.

This paper presents a method of speaker-dependent Thai words speech recognition for dysarthria. We propose two main points: (i) to use a small number of frames set to provide a fast response time for real time assistive devices controlling, and (ii) to improve the recognition rate using tone model.

The organization of the paper is as follows. Section 2 describes our baseline system used in the experiment. Section 3 explains our proposed tone modeling. The experiments and results are given in Section 4. We then conclude our work in Section 5.

2. Baseline System

For a decade, the phoneme based has been the dominant method of modeling speech acoustics, since the number of unit is small. However, a phoneme spans an extremely short time-interval. Thus it is not suitable for integration of spectral and temporal dependencies. The focus has shifted to a large acoustic context such as syllable. The syllable is an attractive unit for recognition for several reasons [ 5 ]. Therefore, we believe that a syllable-based speech recognition system is robust and more suitable than a phoneme-based one for Thai language.

Since not all syllables are of equal duration, we extracted spectral features from a number of frames at the time points between 5 to 95% of duration with the equal step size. In this thesis, we used 15 frames. For each frame, the 12th order of RASTA coefficients [ 6 ] computed within a Hamming windowed 25 ms frame were used. Therefore, a syllable is represented by 180 feature parameters. Since a neural network learns more efficiently if the inputs are normalized to be symmetrical around 0, all feature parameters are normalized to lie between -1.0 and 1.0 using the following equation.

3. Tone Modeling

Since the tone information from the rhyme portion (vowel + coda) has been shown to provide better recognition rate than that from the whole syllable [ 8 ], our tone features are extracted form the rhyme portion. The Average Magnitude Different Function (AMDF) algorithm [ 7 ] is applied for F 0 extraction with 60 ms frame size and 12 ms frame shift. The differences in the excursion size of F 0 movements related to differences in voice range between speakers are normalizedby converting raw F 0 values to an ERB-rate scale [ 9 ]. An F 0 is basically a physiologically determined characteristic and is regarded as being speaker dependent. Therefore, a z-score normalization is then employed based on the mean and standard deviation computed from the raw F 0 values of all utterances for each speaker. Since not all syllables are of equal duration, F 0 contour of each syllable is equalized for duration on a percentage scale. F 0-normalized data are then fitted with a third-order polynomial (y=a₀ + a₁x₁ +a ₂x₂ + a₃x₃ ). To evaluate changes in the F 0 height and slope of each syllable, a time aligned F 0 profile is used. The profile is obtained by calculating the F 0 heights at five different time points between 0% to 100% throughout each syllable with the equal step size of 25%. Moreover, the slopes at these five points are also computed by using the polynomial coefficients. The five F 0 heights and five slopes are used as the baseline features. Therefore, the tone feature vector of each syllable has the same dimension of 10.

4. Experiments

4.1 Corpus

We built three data sets of the total of 36 isolated words. The first set is a set of digits while the second and the third is a set of verbs and adverbs used in Thai assistive devices. These words are pre-selected from the set of Thai command we usually use to control assistive devices such as “move”, “left”, “right”, “stop” and so on. The datasets were collected from four normal Thai speakers ( two male and two female speakers), ranging in age from 6 to 8 years and four dysarthric speakers (two male and two female speakers), ranging in age from 7 to 13 years . Each speaker read all three datasets for five trials, and recording was done in the regular environment. All speech data were digitized by a 16-bit A/D converter of 16 kHz sampling rate.

4.2 Experimental setting

We conducted several experiments on all three datasets. Every experiment was performed using a three-layer feedforward neural network. The network had three layers, i.e., input, hidden, and output layers. The number of input units depended on the number of based features plus tone features. The number of hidden were 50, 50, and 100, for datasets of digits, adverbs, and verbs, respectively. The output units were 10, 10 and 16, for datasets of digits, adverbs, and verbs, respectively. All feature parameters were normalized to lie between -1.0 and 1.0. The network was trained using the standard back-propagation method. Initial weights were set to random values between -1.0 and 1.0. Five-fold cross-validation approach was used. We chose a set as a test set while the other sets served as the training set, and repeated the experiment five times by using different test sets.

4.3 Results

The experiments consist of two separated parts: (i) the experiments for the suitable number of frames for syllable based speech recognition, and (ii) the experiments to compare the performance between baseline features and incorporated tone features for both normal and dysarthric speakers. The first experiment was performed on only adverb dataset with five sets of frame features, i.e., those with 3, 6, 9, 12, and 15 frames. For each frame set, we selected frames at the time points between 5 to 95% of duration with the equal step size. For example, the 6-frame set was chosen at 5, 23, 41, 59, 77, and 95% of duration. Each frame feature consists of 12 RASTA coefficients, so there will be 36, 72, 108, 144, and 180 RASTA coefficients for the 3, 6, 9, 12, and 15 frame sets, respectively. The result is showed in Table 1. For normal speakers, the result has not much significant difference. However, for dysarthric speakers, the 15-frame provides the best result. Therefore, we used the 15-frame configuration for the second experiment.

**Table 1. Comparison of recognition performance with the difference of frame numbers.**
No.	3	6	9	12	15
Normal	94.00	99.00	99.50	94.00	99.00
Dysarthria	67.00	82.50	86.00	85.00	87.50

The comparison of recognition rates and error reduction rates for normal and dysarthric speakers are shown in Table 2 and 3, consecutively. Using the baseline features only, the accuracies of the normal speaker are almost 100%, because the system is trained for each person and with a small vocabulary. Nonetheless, the performance of the system is reduced 10% to 20% when performed on the dysarthric speakers. After incorporating tone information in the experiments, the better results were obtained for both groups especially for the dysarthric group (about 10% - 40% of error reduction rate).

Table 2. Comparison of recognition performance between baseline features and tone features included for normal speakers. %ERR is error reduction rate.
	DIGIT			ADV			VERB
	Baseline	+TONE	%ERR	Baseline	+TONE	%ERR	Baseline	+TONE	%ERR
M1	100.00	100.00	-	100.00	100.00	-	98.75	100.00	-
M2	100.00	100.00	-	100.00	100.00	-	97.50	98.75	-
F1	96.00	98.00	50.00	98.00	100.00	100.00	97.50	98.75	50.00
F2	100.00	100.00	-	98.00	98.00	-	95.00	95.00	-
Avg	99.00	99.50	50.00	99.00	99.50	100.00	97.19	98.13	50.00

Table 3. Comparison of recognition performance between baseline features and tone features included for dysarthric speakers. %ERR is error reduction rate.
	DIGIT			ADV			VERB
	Baseline	+TONE	%ERR	Baseline	+TONE	%ERR	Baseline	+TONE	%ERR
M1	76.00	80.00	16.67	80.00	84.00	20.00	68.75	71.25	8.00
M2	92.00	94.00	25.00	96.00	98.00	50.00	92.50	93.75	16.67
F1	98.00	100.00	100.00	94.00	94.00	0.00	88.75	92.50	33.33
F2	80.00	86.00	30.00	80.00	80.00	0.00	-	-	-
Avg	86.50	90.00	42.90	87.50	89.00	12.00	83.33	85.83	15.00

5. Conclusion

We propose a new approach to exploit the tone features to dysarthric speech recognition for Thai language to enhance the accuracy. From the experiment, we found that the 15-frame is the best that contains sizeable parts of syllables. We also demonstrated that incorporating tone model improved the recognition performance for dysarthric speakers. In the future, we plan to extend this work by using large vocabulary dataset. Other prosodic information such as duration and stress modeling will also be investigated in our future work.

References

Luksaneeyanawin, S. (1998). “Intonation in Thai”, In Intonation System: A Survey of Twenty Language , Edited by D. Hirst and A.D. Cristo, pp. 376-394.
Kostov, A., Chen, X., and Beliveau, C. (1997). “Hidden Morkov modeling in on-line dysarthric speech recognition”, Advancement of Assistive Technology , 195-199.
Yorkstone, K.M. (1988). Clinical Management of Dysarthric Speakers , Austin TX: PRO-ED inc.
Rupal, P. (1999). “ Identifying Information-Bearing Prosodic Features in Severely Dysarthric Speech”, Draft Ph.D. Thesis Proposal , Department of Speech Language Pathology, University of Toronto.
Thubthong, N., and Kijsirikul, B. 1999. A syllable-based connected Thai digit speech recognition using neural network and duration Modeling. In Proc. IEEE Int. Symposium on Intelligent Signal Processing and Communication Systems, pp.785-788.
Hermansky, H. and Morgan, N. (1994). “RASTA processing of speech”. IEEE Transactions on Speech Audio Processing 2(4): 578–589.
Ross, M. J., Shaffer, H. L., Cohen, A., Freudberg, R., and Manley, H. J. (1974) “Average magnitude difference function pitch extractor”, IEEE Transactions on Acoustics, Speech, Signal Processing , vol. ASSP22, pp. 353– 362.
Thubthong, N. and Kijsirikul, B. (2002) “An empirical study for constructing Thai tone models,” in Proc. the 5 th Symposium on Natural Language Processing and Oriental COCOSDA Workshop , pp. 179–186.
Hermes, D. J. and van Gestel, J. C. (1991) “The frequency scale of speech intonation,” Journal of the Acoustical Society of America , vol. 90, no. 1, pp. 97–102.