Machines Need Linguistics To Understand Natural Language

Machines Need Linguistics To Understand Natural Language

This post may be  a bit more technical than my first post but I wanted to describe the foundations of some of the computational problems that I needed to solve in order for a machine to be able to read natural language (i.e., human language). I hope to return to more general topics in the next post.

After finishing my Master’s degree in mathematics in 1999, I switched my focus to computer science to learn more about the technologies that I would need to solve my reading problems. The two technologies that I started with were optical character recognition and speech synthesis.

Optical character recognition (OCR) is the process that converts an image of printed text to words. The printed characters in the image are isolated by identifying all white space separating them on the page. The OCR software compares each printed character to its database of actual characters and their shapes in order to find the closest match.   The software acquires its database from humans who have labeled thousands of printed characters with the correct answers. This approach to building a database is called machine learning and is one of the branches of artificial intelligence.

Speech synthesis is the task of taking text as input and generating audio that we can recognize as speech. Two widely used methods of generating speech are called formant synthesis and concatenative synthesis. Formant synthesis constructs each speech sound (phoneme) by playing different frequencies at the same time. This is similar to the way a musical synthesizer creates the sound of an instrument. Concatinative synthesis is the method of producing a word of speech by joining together recordings of a person saying each phoneme that is used in the word. These speech samples are obtained by recording a person speaking thousands of sentences and then splitting the recordings to isolate small segments of speech. These segments can then be reordered to produce a wide range of sentences that the speaker did not actually say.

The reading technology that I was looking for would use OCR to take text from any printed source and read the text aloud using a speech synthesizer. To get an idea of the complexity of the text recognition problem, I wrote a rule-based program to analyze a picture of a page and isolate each character.  My program then made simple guesses about which characters were on the page. It appeared that improving the accuracy would rely mostly on having more training data (“the correct answers”). I noticed that an image of the letter “e” should be converted to the character “e” no matter where it occurred in the text. That is, the accurate recognition of a character does not depend on the context in which it is used.

Then I turned my attention to the task of producing speech from text. I took a recording of a person reading a passage of text and I split it into short audio segments. With this collection of audio clips, I wrote a program to reassemble the sounds in different orders to produce different words. I observed that there are many factors that determine the way that each phoneme should sound in a sentence and that many of these factors depend on how and where the phoneme is used. For example, the “l” and the beginning of the word “lamp” has a different sound from the “l” at the end of the word “ball”, partially because the tongue behaves differently in each case. I saw this as being a hard problem but also a very interesting one.

I had never studied linguistics before so I began taking linguistics courses at UGA in the Spring semester of 2000. In January 2001, I began working seriously on Cynthia, my own speech synthesizer written completely in the programming language Prolog. The first version of Cynthia used a knowledge base that I created of all of the speech sounds in English with facts about how each sound is used in different contexts, like the example with the phoneme “l” in the words “lamp” and “ball”. Also in Cynthia’s knowledge base were rules that I wrote for sounding out pronunciations of words that were not in her dictionary and for assigning correct stress to syllables.

I saw the power of these rules (called phonological rules) for manipulating the sounds of the phonemes in context. I wrote phonological rules for reading text with a Southern drawl. For example, Cynthia could convert the text “did you eat yet” to the pronunciation “dih-juh-ee-cheht”. In Southern mode. Cynthia would also pronounce “Bill” as “bee-yuhl”. I connected Cynthia to a version of Eliza (the famous 1966 chatterbot) and debuted Cynthia publicly in March of 2001 at an on-campus computer show. Visitors were able to type messages to Cynthia and listen to her speak the responses generated by Eliza with or without the Southern accent.

As useful as phonological rules were, I quickly found their limitations. I wanted to add intonation to Cynthia’s speech but intonation does not depend just on what phonemes are being said, or even what words are being said. The way we inflect a sentence depends on what phrases we use and what those phrases mean in the sentence. Writing a computer program to understand sentences on this level is beyond the scope of rules to convert the words “eat yet” to the pronunciation “ee-cheht”. If you found yourself saying “I like this one but I really like that one”, then you may unconsciously put extra stress on the words “really” and “that”. In order for a machine to produce that stress, it would need to understand that you were making a contrast.

Natural language processing (NLP) is the branch of computational linguistics and artificial intelligence that attempts to give a machine the ability to analyze natural language and its structure in order to make practical and useful decisions. It seemed clear to me that the biggest barrier keeping us from having machines that could solve these language problems was not a need for faster machines and more data; we needed better computational models of natural language. This was what I needed to focus on.


8 Replies to “Machines Need Linguistics To Understand Natural Language”

  1. Linguistics is necessary for NLP just as much as chemistry is necessary for computational chemistry, or astronomy is necessary for computational astronomy. We must beware of the notion that a few machine learning “breakthroughs” are going to make all specialized knowledge unnecessary. Computation doesn’t work that way.

    1. Indeed, linguistics makes the difference between understanding meaning and structure in text and mining for data.

  2. What I appreciate even beyond the technological genius behind Cynthia and your ability to create programs that synthesize speech is your ability to explain it to others. Thank you for that! I’m very interested in reading future installments of your blog.

  3. Well explained; true genius backed by the intense work to prove it. Bravo, Dr. Hollingsworth. You continue to inspire me!

Leave a Reply

Your email address will not be published. Required fields are marked *

Download Skimcast!