INART 55

History of Electroacoustic Music

Speech Synthesis: The Voder and the Vocoder


Beginning in 1936, Homer Dudley (1896-1987), a researcher at Bell Labs Acoustical Research division, investigated the possibility of synthesizing human speech electronically. The aim was to conserve bandwidth over telephone circuits by sending control signals instead of actual vocal signals.

Dudley discovered that vowels could be simulated with a an oscillator that produced a wave containing many harmonics and a set of bandpass filters that eliminated all but a specific set of frequencies (a limited band). The wave is analogous to the sound produced by the larynx. The vocal tract then performs a variety of filtering operations on the sound, depending on the vowel being produced. Vowels are characterized by specific formants that appear in the spectrum of a person's voice. Regardless of the pitch at which a person speaks or sings, these formant ranges are predominant. Recognizable vowel sounds may be produced with as few as three formants, each centered around a given frequency range and at a respective amplitude. The formant frequencies and amplitudes differ, depending on the vowel sound being produced. Three approximations are shown below:

Another characteristic of speech is noise, which the vocal tract filters to produce unvoiced sounds (such as "sss," "shhh," "fff") and plosives ("k," "ch," "p"). Voiced plosives -- such as "buh" and "duh" -- are produced with a combination of filtered noise and vowel sounds.

At the 1939 World's Fair in New York and in San Francisco, Bell introduced the voder (Voice Operating DEmonstratoR, a machine by which a technician could create a facsimile of human speech. The machine produced a sawtooth-like wave that was sent through a series of bandpass filters. The operator manipulated a set of ten switches, each of which controlled the output level of a bandpass filter. Depressing a bar with the wrist controlled the balance of pitched sound and noise. A footpedal controlled the pitch, allowing vocal inflections to be produced.

Having shown that intelligible speech could be produced comparatively simply, Dudley's next step was to eliminate the human operator and instead create an analysis unit. In 1940, Dudley introduced the vocoder (VOice CODER). A vocal signal was sent through a bank of bandpass filters. The output levels of each filter were then directed to a corresponding output filter, through which noise and "buzzy" sound was sent. The result was that the signal sent through the output filters would "talk."

The vocoder, then, employs subtractive synthesis, as did the Trautonium. The two steps, analysis/synthesis, not only allow speech to be reproduced, but also to be manipulated. One way is to vary the pitch of the signal being sent through the output filter bank, thus making the "voice" higher or lower. "Harmonies" may be produced by sending more than one signal to the synthesis filters. Another way is to redirect the output of the analysis filters to synthesis filters that do not output the same frequency band. For example, the analysis of low frequencies may be sent to high output frequencies, thus remapping portions of the input spectrum. Directing low analysis frequencies to higher synthesis frequencies may produce a sound that is nasal. Directing high analysis frequencies to low synthesis frequencies may produce a sound that sounds like the speaker has a bad cold.

The irony of the vocoder is that it is expensive, both in terms of bandwidth and in circuitry, to analyze and resynthesize vocal signals effectively. So it completely defeated Dudley's original purpose, and was never used in telephone technology.

Many recordings in the 1940s and 1950s used vocoder-like effects. Examples include a talking foghorn on lifebuoy soap commercials, a talking train on Bromo Seltzer commercials, children's recordings such as Sparky's Magic Piano, and the talking train in the Disney film Dumbo. These effects were not produced with a vocoder, but with a simpler device called a Sonovox. The Sonovox worked via an audio input, but instead of loudspeakers it had two small disks. The device was held to the throat, with the two disks pressing on either side of the larynx, and a performer would silently mouth the words of a speech passage, being careful to add the unvoiced fricatives (f, sh, t, etc.) The audio sent to the disks would be substituted for vocal cord energy, and the result was a "talking signal." Such devices are now used medically for patients who have had their larynxes removed -- a buzzing sound produced by the device allows those who cannot speak to create an audible, speech-like sound. Later devices, such as the "talk box," used by artists such as Peter Frampton, were based on the Sonovox.

SOURCES:
http://ptolemy.eecs.berkeley.edu/%7eeal/audio/voder.html
http://ptolemy.eecs.berkeley.edu/%7eeal/audio/vocoder.html
http://www.newmusicbox.org/third-person/oct99/links.html
Wendy Carlos on vocoders