In Star Trek IV: The Voyage Home (1986), the ship’s crew travels back in time to 1986 in an attempt to save whales from later extinction in the 24th century. In an early scene, Scotty, the chief engineer, attempts to use a computer to show a 20th century researcher how to design a new material. He attempts to access the computer (an old Macintosh) by saying, “Hello computer.” When nothing happens, the ship’s physician hands him a mouse. Thinking it’s a microphone, Scotty speaks into the mouse, repeating, “Hello computer.” When the computer still doesn’t respond, the 20th century scientist tells him to just use the keyboard. Reaching for the keyboard Scotty replies, “The keyboard: how quaint!”
In the nearly three decades since that movie was filmed, voice interaction with computers has become pervasive with the advent of voice recognition and natural-language interface tools such as SIRI, Skyvi, Sherpa, ViaVoice, and Dragon Naturally Speaking. However, this technology is not without its flaws.
There are many challenges in so-called natural-language processing as opposed to the simpler, limited-word, voice interaction such as you would find on automated answering services, e.g., “Please say or press the number one.” Some tests have suggested that SIRI may misinterpret requests nearly 40 percent of the time. For example, a phrase that is easy for humans to interpret, “Where is Elvis buried?” is interpreted by SIRI to mean, “What is the location of a person named ‘Elvis Buried?’”
When creating an effective voice/computer interface, it is necessary to translate sound into text. In principle, this is a straightforward task in signal processing and pattern recognition. A microphone translates the sound of your voice from an analog signal into a digital signal stream. Signal processing algorithms (e.g., time-domain and frequency domain transformations) are applied to enhance the signal and extract representative features or parameters, which in turn are mapped to specific words that represent the acoustic signal. Mapping the signal representation to a specific word is performed via pattern recognition and machine learning techniques.
While there is an enormous amount of history and success of signal processing, this transformation can be difficult in noisy environments such as in crowds or at airports or when processing homonyms, e.g., accept/except, affect/effect, advise/advice, to/too/two, etc.
Once the computer has successfully (or not so successfully) processed a sequence of words, it must then parse these words into phrases and sentences using rules of syntax and then finally into meaning (the semantics). Further complicating this matter is the fact that a high degree of variability exists in English, making it a very challenging language for computers to process, as linguist Stephen Pinker points out in his TED talk on human language, “What our language habits reveal.” In the talk, Pinker compares challenges in the interpretation of language with perceiving optical illusions.
In order to be successful at natural-language processing, systems must also be able to interpret the context surrounding sentences and paragraphs. Without skipping ahead, read the following paragraph suggested by M. Klein (“Context and Memory” in L. T. Benjamin, Jr. and K. D. Lowman (eds.) Activities handbook for the teaching of psychology 1981) and try to make sense of it. Note that while each sentence is in perfect English it is difficult to understand what it’s about.
A newspaper is better than a magazine. A sea shore is a better place than the street. At first it is better to run than to walk. You may have to try several times. It takes some skill but is easy to learn. Even young children can enjoy it. Once successful, complications are minimal. Birds seldom get too close. Rain, however, soaks in very fast. Too many people doing the same thing can also cause problems. One needs lots of room. If there are no complications, it can be very peaceful. A rock will serve as an anchor. If things break loose from it, however, you will not get a second chance.
By adding a single word, “kite,” this information falls into place as being completely coherent and natural. It is very challenging to represent context for computer-based processing of language, and it is difficult for computers to understand everyday “real-world” knowledge. In order for a computer to “understand” the phrase, “George Washington threw a silver dollar across the Potomac,” it must understand that George Washington was our first president, that the Potomac is a river in Maryland, and that a silver dollar is a coin.
Finally, human use of language includes not only the translation of words and phrases from one brain to another, but also modulation by facial expression, tone, and intonation. In sending text messages, we improvise by using emoticons to give the semblance of a smile, raised eyebrow, or other expression. But these are poor substitutes for being in close proximity to another human and listening to their tone, seeing their facial expressions, and observing body language. Speaking to SIRI or any other language-interface software is not the same as talking with another human.
While progress will continue in natural-language processing and its use in human-to-computer interaction, there’s much more work to be done. In the meantime, I must be content with speaking into my mouse.