Karaoke Trainer

Swarna Solanki and Omar Awan


A device was designed to simulate music therapy sessions, a specialized form of speech therapy, in which song is used to motivate speech.  The device has been tailored to the needs of a specific client, who has severe mental retardation and can only meet with his therapist a few times a week.  Major aspects of the device include a microcontroller which controls an MP3 chip, a voice detection system to ensure the client vocalized, and positive feedback and/or encouragement for motivation.  The final product has a simple interface to facilitate independent use.  Additionally, the device was designed to be portable and durable so that it may be moved to a variety of locations.  This device has great potential to improve the quality of life for our client.  After using our device for a short time, there was an obvious increase in his vocalization outside of singing with the device itself. 


voice detection; speech therapy; microphone; music therapy


Our client is a 19 year old male who has severe mental retardation.  This causes him to have limited dexterity, visual impairment, and limited verbal capabilities.  The client has become motivated to practice vocalizing through music therapy, whereas other forms of speech therapy were ineffective.  Since his therapist can only work with him a few times a week, our device is meant to simulate the music therapy sessions so that he may practice speaking on his own time and independently.  It should employ a simple interface, preferably one comparable to an iPod shuffle, which he is already familiar with, so that he may use it without much supervision.  

A music therapy session involves the client’s therapist singing/playing a familiar song and stopping before a word/syllable that he is expected to verbalize.  Once he verbally finishes the line, the therapist will continue singing with him.  If he does not respond, the therapist encourages him to finish the line before moving on.  The goal is to develop a device that simulates the actions of the therapist.  Currently, there are speech therapy devices, but they utilize picture recognition to motivate speech.  There are few, if none, that utilize music for encouragement.  Our device will be tailored specifically to our client’s needs and wants, so that it will be more effective than any device in the market.


.  A photo of our final device from the top (face plate).  Visible are the On/Off button at the top left, along with the sensitivity adjustor at the top right.  Along the left side of the device is the volume adjustment knob.  Along the right side are the microphone (higher) and audio (lower) jacks.  On the faceplate itself are the LED array switch (upper left), LED Array (left), and next button (right).  Figure 1. (Click for larger view)

At the beginning of the project, it was unclear whether the client responded to music therapy because of the social interaction or for the enjoyment of continuing to hear the music itself.  We mimicked a typical therapy session by pressing the pause and play buttons for a digital music player. Play would only continue after the client vocalized.  After just a few minutes of coaxing from his therapist and mother, he was able to learn to respond to the recorded song.  After observing the therapist, we determined that the best positive feedback was to simply continue the song and the best way to encourage vocalization from our client was to repeat the last few words before his expected response.

Once there was evidence that this device had the potential to be successful with our client, we moved on to developing a functional prototype which would automatically pause the music, and only continue after it detected that the client vocalized.  The major components were the music player, control for music player, and voice detection circuit.  The final device is shown in Figure 1.

Playback Module

The Rogue Robotics uMP3 Playback Module was used as a music player.  The module uses an SD card to store songs and can be commanded to play certain songs using a microcontroller.  Each song was stored in a folder in the SD card.  Each folder contained the song broken into separate tracks according to the client’s expected responses.  The audio output is sent through a jack that can use an external speaker.  The volume control knob is on the device and is easily accessible.

Next Button

This graphic is of the microcontroller algorithm.  There is a large loop which sequences through the songs.  When a song is chosen, there is a For Loop for how many parts there are in the song.  The loop begins by playing the song and when the song is finished playing, the PIC waits for a response.  There are 2 conditions, when there is no response and when a response is detected. Figure 2. (Click for larger view)

A next button was adapted into the circuit as an input to the microcontroller/programmable integrated circuit (PIC).  The PIC only allows the next button to be activated before the second track of the song begins to ensure the client practices a song in its entirety once he commits to singing, but also allows freedom of song choice.  It is placed in an easily accessible location for simple song selection.

Programmable Integrated Circuit

The program algorithm for the PIC is given in Figure 2.


Testing our prototype was successful, but it only worked well with the electret microphone we initially designed our circuit with.  If we used a headset microphone, which was our goal for the final device, the voice detection system could not register the verbal response.  This was due to a lower sensitivity of the headset microphone.  Thus, for our final circuit we needed to eliminate the dependence on any specific microphone by adjusting the sensitivity of the voice detection system.

Voice Detection Circuit

This block diagram shows the final voice detection circuit, which is the same as the previous, except the instrumentation amplifier sends its signal to a comparator, which has a variable threshold value, and the comparator sends its signal to a retriggerable one shot. Figure 3. (Click for larger view)

The voice detection circuit went through several designs including different filters and amplifiers until it had been optimized.  The voice detection circuit is shown in Figure 3.

The instrumentation amplifier amplified the microphone signal 100x, resulting in a signal of about 0.5 V, and removed background noise.  Vocalization into the microphone resulted in a signal of greater amplitude than background noise and of greater duration than hand claps or finger snaps.  Thus, to decide whether the output of the instrumentation amplifier represented verbalization as opposed to any of these other possible sounds we checked both the duration and amplitude of the signal. 

The comparator allows the user to adjust the sensitivity based on the microphone used so that the output from the amplifier does not have to reach the amplitude of a digital high (~1.5V) to trigger the one-shot, but can be any other voltage (because it will trigger the comparator, which will trigger the one shot).  Using a potentiometer, we permit the user to set this reference voltage, between 0.2 and 1.2 V, based on the response of the microphone he/she uses.  The LED array allows a visual comparison between the actual amplitude of the output from the instrumentation amplifier and the reference voltage at input of the comparator.  If the output of the instrumentation amplifier goes above the reference voltage, the comparator will trigger the one shot.  This setup allows manual sensitivity adjustment to ensure the device registers the client’s response, which is an improvement over simply increasing the gain since it allows the sensitivity to be set at an optimal level because both overly sensitive and insensitive circuits are undesirable. The sensitivity control was made slightly inaccessible to avoid accidental adjustment.

The re-triggerable one shot only outputted a digital high if the signal from the instrumentation amplifier was above the comparator threshold.  It also eliminated high frequency components of the signal by sending out a constant 5 V output as long as there were high enough peaks from the microphone.  The PIC was used to determine if the duration of that 5V output was long enough to represent vocalization. 


We evaluated the practical application of our device, through testing sessions with the client, and changes were made according to the results of these sessions. Our final device met all of the critical design specifications set at the beginning of the design process.  The device is portable, durable, and light weight so that it may be taken to and used at a variety of locations.  It has a simple interface to encourage independent use.  It has an accessible On/Off switch, next button, volume control, and audio jacks.  The sensitivity adjustment ensures a robust and efficient system.  The device contains the client’s favorite songs and motivates vocalization.  We were not able to invoke a design that allowed the addition of songs at a later date, but after careful evaluation, this capability was agreed by all to be a very low priority. 

A photo of our client using the device on our delivery date.  He is wearing a headset with a microphone to respond to the device and the music is being played by an external speaker, which is adjacent to the device.Figure 4. (Click for larger view)

The final session to test the device gave clear evidence that the Karaoke Trainer can accomplish its goal.  The client enjoyed the device very much and showed signs of sadness when the device was taken away.  He immediately responded to the pauses in the song and was encouraged by the continuation of the song.  If the device did not register his response the first time he would make another attempt that was louder and clearer.  These are extremely positive results because the main goal of the device was to facilitate improvement in the client’s ability to verbalize in addition to music therapy sessions.  A photograph of the client using the device is given in Figure 4.

Our client using the device is given in the following video.


First, we would like to thank our client, his family, and his speech therapist.  They have been a great resource and very helpful in this project.  We would next like to thank the NSF for funding this project.  Invaluable work is being done and grant #0453339, has been tremendous in supporting us to build this device.  We would also like to thank Dr. Richard Goldberg, Kevin Caves, and Steve Emanuel for their contributions, problem-solving and debugging assistance, and the ability to procure for us tools to complete this project.

Correspondence should be addressed to:

Omar Awan
102 N Drawbridge Ln.
Cary, NC 27513