Making A Synthesized Voice

Making A Synthesized Voice

During the first two years of my work on my Ph.D. at Cambridge, I continued to work on Cynthia (my own speech synthesizer). I had just finished a productive year in theoretical linguistics where I specialized in acoustic phonetics and speech synthesis, and I had a lot of momentum for adding new features to Cynthia to make her sound more natural. I began to do commercial work and consulting with Cynthia and was working on a contract with a publisher for Cynthia to speak pronunciations for an online dictionary.

Cynthia was using the MBROLA database of prerecorded speech sounds to produce the final audio files containing the speech. MBROLA was a free project distributed by the Faculté Polytechnique de Mons (Belgium). They offered an option to buy a license to use MBROLA commercially. Sometime in 2005 or early 2006, a company bought the MBROLA project from the university. I received a letter from their attorneys informing me that the commercial license was no longer an option and that I could not use MBROLA for any commercial work.

Without a voice for Cynthia, I could not continue my commercial projects or my work on building Cynthia into screen readers. I immediately created a detailed plan for how to produce my own voice for Cynthia from scratch. The creation of a synthesized voice for Cynthia would require four general steps.

1. Create a list of sentences with all English sound combinations

The first step in creating a synthesized voice is to make preparations to record a human speaking. All speech sounds (phonemes) that occur in English speech must be recorded. Some sounds are more common in English speech than others. For instance, the sound /T/ as in the word “stop” occurs a lot in English as well as the combination of sounds /S T/ as in “start” and “state”. However, the sound called /ZH/ (using the Arpabet alphabet) occurring at the start of a word is used in the word “genre” but is rarely used elsewhere in English.

To make the recording process as efficient as possible, a special collection of sentences is constructed to produce all possible phonemes in all possible contexts when spoken aloud. For example, the spoken word “computer” contains two phonemes /K AH M/ for the syllable “com”. The word “company” also begins with the syllable “com” but it sounds different from the “com” in “computer”. The first syllable of “company” is stressed and the first syllable of “computer” is not. Therefore, a natural sounding voice would require separate recordings for stressed and unstressed vowels /AH/.

2. Record a human reading the sentences

The second step in creating a synthesized voice based on a human’s voice is to record a person reading a collection of carefully selected sentences. The sentences are constructed so that every combination of vowels and consonants that are possible in English are represented in the sentences. 2,000 sentences is enough to capture all of the combinations of English phonemes but it is not enough to capture every version of every syllable (stressed, not stressed, rising intonation, falling intonation, etc.). A synthesized voice created using a good set of 2,000 sentences will be understandable but may sound a bit rough and of low quality. 6,000 sentences is enough to capture many of the stress and intonation variations of vowels.

3. Split the recordings into individual speech sounds

Once all of the voice recordings have been made, each recording of a sentence must be split into short snippets containing the individual phonemes. For example, the word “table” consists of the phonemes /T AY B UH L/. A recording of the word “table” must be divided into 5 snippets. These snippets will be mixed and matched to produce speech that sounds like the original speaker saying things that were not actually recorded.

Splitting the audio into phonemes is the most time-consuming and costly part of the process of creating a voice. Humans are required to edit each sound file to do an accurate job finding the boundaries. It generally takes a trained human one hour to divide one minute of speech manually into individual sounds. A computer can be used to guess where the boundaries between sounds are by looking at changes in volume and audio patterns but the computer often makes mistakes. For example, if you are watching an online video on a website that inserts commercials between the scene changes in the video, then you may notice that the commercial breaks do not always occur in the correct places.

4. Teach Cynthia the rhythm and intonation patterns of the human reader

The final step in the process of creating a voice is to create a table of all of the individual sounds that were isolated in the previous step. This table contains information about each sound such as its length (in milliseconds), the pitch of the speakers voice (for vowels and some consonants), and the sounds occurring immediately before and after in the full recording.

A training program learns the patterns of length, pitch, and context well enough that if the speech synthesizer encounters the word “stable”, then it can figure out the correct sounds to use with the appropriate lengths and pitches to produce speech that sounds like the original speaker.

I implemented step 4 in Cynthia during my first year at Cambridge as part of my M.Phil. (like a Master’s degree in the U.S) and the results were of high quality. After Cynthia lost her voice, I composed a plan to complete steps 1 through 3. I finished step 1 (generating a balanced list of 6,000 sentences) but steps 2 and 3 proved to be too costly and time-consuming for me while I was working on a Ph.D. on a different topic.

The FestVox project performed steps 1 through 3 using 2,000 sentences and released their database of sounds around this time. I built a voice for Cynthia using their data and Cynthia’s version of step 4 as a proof of concept but I did not produce my own voice from scratch.

Although I still have plans to get back to Cynthia and I believe that there is still a lot of room for improvement with the naturalness of synthesized speech with unrestricted text and, especially, with conversational speech, my focus now is on Skimcast and how to better enable reading and learning. The loss of Cynthia’s voice was devastating at the time but I had a much bigger project just beginning, namely the linguistics behind Skimcast.

IMAGE: Spectrogram of Cynthia saying the word “optimistic”.

2 Replies to “Making A Synthesized Voice”

Leave a Reply

Your email address will not be published. Required fields are marked *

Download Skimcast!