Audio-Visual Keyword Spotting for Access Technology in Children with Cerebral Palsy and Speech Impairment

Silvia Orlandi¹,Jiaqi Huang*^1,2, Josh McGillivray*^1,3, Fanny Hotzé¹, Leslie Mumford¹, Tom Chau^1,2

¹Holland Bloorview Kids Rehabilitation Hospital, ²University of Toronto, ³McMaster University , *These authors equally contributed

INTRODUCTION

Cerebral palsy (CP) is the most common cause of physical disability in children with a prevalence estimated between 2 and 3.5 per 1000 live births [1]. CP is a range of non‐progressive movement and posture disorders resulting from an injury to the developing brain occurring before, during or up to two years after birth [2].

Children with CP and complex communication needs (CCN) may have difficulty manipulating common interface devices due to limited voluntary muscle control and reduced muscle tone [3]. One way to help these children to interact meaningfully with their environment is by providing them with an access technology (AT) that improves their communication ability and promotes meaningful participation and engagement. An AT is a form of assistive technology whose key purpose is to translate a user's intention into a useful control signal for a user interface and eventually into a functional activity. Conventional motor‐based pathways (e.g., mechanical switches) have limited viability for children with CP who can have difficulty mustering the force required to activate a switch, limited ability to release a switch once activated, and a tendency to activate a switch multiple times [3]. Orofacial gestures (e.g., tongue protrusion, mouth, lip and eyebrow movements) and keyword-spotting (for verbal children) are promising access pathways which could circumvent these issues [4-7]. Although there has been outstanding progress in speech recognition techniques during the past thirty years, standard speech recognition methods show a decrease in accuracy of up to 82% for adult speech-impaired speakers, and an even worse decrease is observed in pediatric users [8]. Moreover, automatic lip reading has not progressed as an access technology due to the difficulty of collecting data and developing a personalized method for children. For individuals with limited motor control who use speech as their primary form of communication, a set of spoken keywords used to build the AT control can make an excellent option. Currently, there are very few commercially available articulatory speech recognition programs to help people with CCN. One example, the articulograph device, not only costs over $5000 but also weighs over 65kg. Moreover, facial probes are required for data collection which can result in user discomfort [9].

Recent studies applied lips movement analysis to provide robust and accurate speech recognition methods that can facilitate the use of communication devices based on speech recognition in dysarthric speakers who have a neurological articulation impairment [10,11]. Visual speech recognition (VSR) and automatic lip reading strive to achieve utterance recognition through the analysis of a speaker’s mouth movements using video recordings without any acoustic input. Howell et al. described a model to recognize lip movements in people with dysarthria, reaching an accuracy of up to 76% on a 1000-word recognition task for a single speaker [10]. Salama et al. combined audio and video inputs for multimodal speech recognition in people with speech impairment, improving recognition accuracy by 7.91% for speaker-dependent experiments compared to pure audio inputs [11]. In order to address the needs of speakers who lack motor control over speech articulators and whose audio speech signals offer more confusion than aid, researchers have started to investigate the feasibility of a visual-only speech recognition system [10,12]. Some studies pointed out that speaker-dependent visual speech recognition can achieve a level comparable to acoustic speech recognition while speaker-independent lip reading, even with improvement through incorporating temporal data, only reached 50% accuracy [10, 12]. A personalized calibration for individual speakers (e.g., speaker-dependent approach) could improve the performance considering that previous methods were independent from the speaker’s physiology and what the speaker does with their mouth when speaking [13].

The purpose of this study is to develop a customized AT developed using speech and lip movement recognition techniques as an alternative communication pathway. To attain this objective an audio-visual speech recognition (AVSR) algorithm was implemented for recognizing specific keywords (e.g., “next”, “go”) that enable the AT control for computer interaction. The AT was tested with two children with CP and CCN.

METHODS

Design

This was a user-case study. The study took place at Holland Bloorview Kids Rehabilitation Hospital. This project was approved by the Research Ethics Board of the Bloorview Research Institute. Informed written consent was obtained for all participants.

Participants

Two participants with spastic quadriplegic or dyskinetic CP (GMFCS level IV or V) and CCN, aged 7 (participant 1 - P1) and 12 (participant 2 - P2) respectively, were recruited to develop a custom keyword recognition AT. Both participants were able to communicate using their own words. P1 presented moderate-to-severe speech impairment and P2 mild speech impairment. Audio and video signals were collected for technology development for each participant. The target words (keywords) were selected with the support of parents, occupational therapists, and speech language pathologists.

Equipment and procedure

A unidirectional headset microphone (Sennheiser ME 3-ew) and a video camera (SONY HandyCamDCR-SR88) were used for data collection. Audio and video recordings lasted around 10 minutes and were recorded during a single session. Samples were collected during computer interaction using 2 specific keywords. Each participant was asked to pronounce the appropriate (and previously chosen) keyword corresponding to their intention. For example, P1 used the word “next” to scan through options and the word “go” to select an option. At the same time, one of the researchers pressed a switch to simulate the action for the participant (this was explained to the participant beforehand). Audio and video recordings were manually synchronized by the staff (e.g., clapping or using light and clicking sounds) at the beginning of each session. All videos were recorded using a frontal view and with controlled lighting conditions.

Analysis

Preliminary video screening was performed to check the video quality (i.e., mouth movements properly recorded, correct filming angle, occlusions). All audio files were automatically segmented distinguishing participants’ speech from other voices and environmental noise. Two researchers labeled the audio recordings using a software interface developed ad hoc to detect target words (keywords) and other speech and sounds. Acoustical features such as 12 Mel-frequency cepstral coefficients (MFCCs) were extracted from each audio sample and used to feed a Hidden Markov Model (HMM) in distinguish keywords. The video files underwent processing in parallel to the audio recordings. Videos were first processed to crop occlusions (e.g. other faces, hands, etc.). Then, a software interface was developed and used for the audio and video synchronization. Keywords timestamps from audio recordings were used as a reference when labeling the videos. The software used the audio labeling to capture the mouth movements corresponding to the keywords. After the identification of audio and video samples of keywords, an existing face tracker (Intraface) was implemented and applied to detect the lip region and track the movements [14]. The algorithm detected the pixel coordinates for each facial landmark. Facial features (93 motion-based and 49 geometric-based features) were extracted. Features were defined using the current state-of-the-art research and included total area of teeth region, ratio of width over height for inner and outer lip contour, gray level, optical flow, mouth total area, etc. [10,11,15]. Best features were selected using F-score method to achieve optimal lip-reading classification accuracy, which will lead to improved control of ATs. The method automatically selected the best 4 features with significant F-score. Different sets of geometric features (e.g., lip height and width) and kinematic features (e.g., velocity and acceleration) were compared evaluating the performance of an HMM classifier, trained as a speaker-dependent classifier to detect the keywords, (e.g., “next” and “go”) using visual features for each keyword set for each participant. Finally, a third HMM classifier was trained using the best audio and visual features. Audio and video features were combined in the same feature vector for each word sample and for each participant. A 5-fold cross-validation repeated 10 times was used to evaluate and compare the performance of all three classifiers in terms of accuracy, specificity, and sensitivity.

RESULTS

Table 2. Classification performance for participant 2 (P2) using audio, visual and audio-visual features
P2	Audio classification	Visual classification	Audio-visual classification
Accuracy (%)	99	74	99
Sensitivity(%)	99	77	99
Specificity(%)	99	72	99

A total of 41 sessions were recorded and 14 videos (6 for P1 and 8 for P2), were eligible for video processing. Two keywords were collected for each participant, “next” and “go” for P1, and “right” and “pick” for P2. 150 keyword samples were identified for each keyword.

Tables 1 and 2 show classification performance for P1 and P2, respectively. Different performances were achieved in the cases of mild and moderate-to-severe speech impairments. The audio classification method was more reliable for P2 with mild speech impairment. Whereas, the visual classification method was more reliable for P1 with moderate-to-severe speech impairment.

The comparison of visual classification performance using geometric and kinematic features (alone and in combination) is reported in Table 3.

DISCUSSION

Our results on the recognition of 2 keywords show that an AVSR can support speech recognition methods and can be used to develop a novel communication AT for children with CP and CCN. For the first time, lip reading has been applied with children with CP. In fact, previous studies extracted visual features from adult participants and there is no published data related to classification performance with pediatric populations. For this reason, we could not compare our results with any other studies.

Table 3. Video classification performance using Geometric and kinematic features for both participants (P1 and P2)
	Geometric features	Kinematic features	Geometric and kinematic features
P1 accuracy (%)	74	59	72
P2 accuracy (%)	78	<50	76

To the best of our knowledge, this is the first study to implement the merging of audio and visual classification methods to build an AT for pediatric use, based on different levels of speech impairment. Results show that AVSR improves classification performance in the case of a moderate-to-severe speech impairment.

Audio and visual features classifier showed an increased accuracy for P1, when comparing to the audio classification and the video classification. This means that audio-visual recognition in the case of more severe speech impairment would be able to identify the keyword the child is saying more accurately, allowing the child to control the AT more easily. The best visual classification performance was obtained using only geometric features as preliminary findings showed that the movement-based features were not accurate (Table 3). This is probably due to the fact that the children experienced poor facial muscle control.

The ultimate AT prototype will use the audio classifier based on the 12 acoustical features, to distinguish target words from not-target words. Our preliminary results showed accuracy up to 80% for P1 and up to 94% for P2 in distinguishing the two keywords from the miscellaneous class (e.g., environmental noise, other voices, other child’s sounds and words). The prototype will implement two classifiers, the first one for target and non-target words and the AVSR to improve the recognition of the keywords among the entire set of target words.

Some limitations of this study need to be pointed out. First, our results are limited to a binary classification (only two words were identified for each child) and only 150 samples were used to test the classifier performance. Moreover, to demonstrate the feasibility of the AVSR implementation in an AT suitable for multi-environment and long-term use, further evaluations are required. Face tracker limitations present a significant challenge for the use of AVSR with children with CP and CCN. In fact, even robust face trackers are very sensitive to position variation, and, if the face is not positioned in a frontal view, the facial landmark recognition can fail. Children with CP and CCN can have difficulties to control their head movements due the poor muscle control.

Further analysis of our data is needed to determine a classifier capable to distinguish target lip movements and not-target lip movements. The identification of a customized threshold to distinguish lip movements from a baseline (neutral face considered as no presence of lip movements) will allow us to increase the sensitivity of our classifier. Additionally, future researches could also be focused on the identification of a larger set of keywords.

Despite these limitations and the need for further studies, our findings will be useful for developing a novel AT for communication to support children with severe speech impairment.

CONCLUSIONS

Access technologies allow individuals who use alternative communication pathways to perform computer-based activities, promoting communication skill development and improved independence. Children with CP and CCN are limited in their AT usage and a cost-efficient, portable and robust speech recognition program is in demand to improve the lives of people with CCN. Audio-visual speech recognition has shown to improve classification performance in the presence of moderate-to-severe speech impairment. In a near future, children with CP and severe speech impairment could use a personalized AT controlled through their speech using audio-visual recognition techniques.

REFERENCES

[1] Colver A, Fairhurst C, Pharoah P. Cerebral palsy. Lancet. 2014 383(9924): 1240-9.

[2] Koman L, Smith B, Shilt J. Cerebral palsy. Lancet. 2004 363(9421): 1617-31.

[3] Chau T, Memarian N, Leung B, Treherne D, Hobbs D, Worthington-Eyre B, Lamont A, Pla-Mobarak M. Home-Based Computer Vision Access Technologies for Individuals with Severe Motor Impairments. Handbook of Ambient Assisted Living. 2012 11: 581-597.

[4] Leung B, Chau T. A multiple camera tongue switch for a child with severe spastic quadriplegic cerebral palsy. Disability and Rehabilitation: Assistive Technology. 2010 5(1): 58-68.

[5] Memarian N, Venetsanopoulos AN, Chau T. Client-centred development of an infrared thermal access switch for a young adult with severe spastic quadriplegic cerebral palsy. Disability and Rehabilitation: Assistive Technology. 2011 6(2): 179-87.

[6] Alves N, Chau T. The design and testing of a novel mechanomyogram-driven switch controlled by small eyebrow movements. Journal of neuroengineering and rehabilitation. 2010 7(1): 22.

[7] Chan J, Falk TH, Teachman G, Morin-McKee J, Chau T. Evaluation of a non-invasive vocal cord vibration switch as an alternative access pathway for an individual with hypotonic cerebral palsy–a case study. Disability and Rehabilitation: Assistive Technology. 2010 5(1): 69-78.

[8] Rudzicz F. Using articulatory likelihoods in the recognition of dysarthric speech. Speech Communication. 2012 54(3): 430-44.

[9] Kohlberg GD, Gal YA, Lalwani AK. Development of a low-cost, noninvasive, portable visual speech recognition program. Annals of otology, rhinology & laryngology. 2016 125(9): 752-7.

[10] Howell D, Cox S, Theobald B. Visual units and confusion modelling for automatic lip-reading. Image and Vision Computing. 2016 51: 1-2.

[11] Salama ES, El-Khoribi RA, Shoman ME. Audio-visual speech recognition for people with speech disorders. International Journal of Computer Applications. 2014 96(2).

[12] Lan Y, Harvey R, Theobald B, Ong EJ, Bowden R. Comparing visual features for lipreading. In International Conference on Auditory-Visual Speech Processing 2009 (pp. 102-106).

[13] Cox SJ, Harvey RW, Lan Y, Newman JL, Theobald BJ. The challenge of multispeaker lip-reading. In AVSP 2008 (pp. 179-184).

[14] Xiong X, De la Torre F. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition 2013 (pp. 532-539).

[15] Velasco MA, Clemotte A, Raya R, Ceres R, Rocon E. A Novel Head Cursor Facilitation Technique for Cerebral Palsy: Functional and Clinical Implications. Interacting with Computers. 2017 29(5): 755-66.

Acknowledgements

This research was conducted with the support of the Ontario Brain Institute, funded in part by the Government of Ontario. We also acknowledge funding support from a Project Grant awarded by the Research Foundation, Cerebral Palsy Alliance.

Audio Version PDF Version

RESNA Annual Conference - 2019