A VOICE-ACTIVATED TEXT READER

Max Kupchik

ABSTRACT

A computer program was developed to read text from hard copy sources and web pages for a person with a visual disability. A commercial development kit was integrated into the program to provide scanner input and optical character recognition. Voice activation was selected for the control system.

BACKGROUND

A text reader is a computer program for the visually disabled that takes as its input either a hard-copy document or a web page, and uses speech synthesis to speak the text of the document. Reading from hard-copy sources requires a scanner and some type of optical character recognition (OCR) software. Several text readers exist on the market today, and all of them use keys on the computer keyboard as a control system. There are no text readers that integrate both web page and scan reading, and that use a voice activation control system.

STATEMENT OF THE PROBLEM

The objective was to design and test a text reader for a person with severely limited vision. The client is classified legally blind; he can perceive outlines of objects but is unable to read text without a special magnifying device attached to a pair of glasses. The text reader was to have a way to adjust the speaking rate and volume, pausing and resuming the reading, and navigating hyperlinks inside web pages.

RATIONALE

The client has difficulty reading magazines, newspapers, and web sites, and wanted a way to reduce the eyestrain caused by the magnifying device, which is made for only one eye. The text reader allowed the client to read documents faster and with more ease and comfort than he could using the magnifier.

DESIGN

The text reader was constructed from several components. A voice-activation model was selected for the control system because the client was unable to see the individual keys on the keyboard. Because the input of an Internet address requires the use of all keys, such a control system is inapplicable. Instead, an address completion feature was designed. The client would speak the partial or full address, possibly omitting such prefixes as "www" and suffixes like "dot com", and the program would complete the address. Both speech recognition and speech synthesis were implemented using SAPI, a speech application programming interface standard used to connect to a speech recognition and synthesis engine. Several compatible engines exist on the market, but a freely redistributable one from Microsoft was chosen for financial reasons.

For scanner input and OCR, a commercial development kit was purchased and integrated into the program. This kit contains facilities for reading and recognizing documents placed on the scanner at an angle or upside down. Additional pre-processing features are provided for enhancing images and color inversion, so that white text on a dark background can be recognized. No comparable kit existed for reading web pages, so an HTML language parser was designed using the tools Lex and Yacc, which are commonly used in the construction of compilers.

DEVELOPMENT

Several features were added to make the program easier to use as development progressed. First was a "last" command that allowed the client to read the previous web page visited or the previous scanned document. Another was Internet address confirmation. Since the speech recognition quality for URLs was poor, it became necessary to allow the user to confirm or reject a candidate address after it was read to him. A third feature was a "back" command that would start reading at a section of text before the one that was currently being read. This allowed the user to repeat certain parts of the document if he did not hear them the first time.

EVALUATION

Testing revealed that speech recognition was far from perfect. To input certain Internet addresses required up to ten attempts. During reading, spurious recognitions would sometimes occur. That is, a command such as "volume down" would be recognized even though the user did not say anything. Even though audio feedback from the speech output was isolated from the microphone by means of a headset, certain noise still entered from the environment and the user's breath.

OCR performance depended on the type of document. Most black-on-white documents were recognized with near perfect accuracy, but white text on a dark background was considerably worse. Also, recognition did not work well for text superposed on graphics. Although the development kit provided ways to improve accuracy for specialized documents, they were inapplicable to the text reader as there was no way to anticipate what kinds of input it would encounter in the field. For most sources, such as newspapers, books and magazines, OCR quality was good.

DISCUSSION

The text reader met its design goals. The client is currently using the program at his home to read many different media. For web pages, the text reader is not particularly useful since it does not deal with active HTML elements, which most websites use today. The speech recognition system works well for commands, but poorly for Internet address input. However, it is much faster than the only other alternative: a single-key control system in which one key iterates through the entire alphabet to put the address together one letter at a time.

Max Kupchik
3 Maywood Dr.
Nashua, NH 03064