What is speech perception?
The perception of human speech signals involves a variety of phenomena that initially appear trivial but are actually exceedingly complex. The basic phenomena are the ability to perceive the same speech message correctly when it is presented by various speakers or by the same speaker performing under different conditions (the phenomenon of perceptual constancy); differences in the perceptual processing of speech and nonspeech sounds; the ability to discriminate among sounds from different speech-sound categories, but only poorly among sounds from within the same speech-sound category (categorical perception of speech); and the problems presented by the signal’s immediate speech sound (phonetic) context for the correct identification of the signal (the phenomenon of context-sensitive cues).
Each phenomenon is so complex primarily because of the nature of the speech signal. A spoken language is perceived by a native listener as a sequence of discrete units, commonly called words. The physical nature of the typical speech signal, however, is more accurately described as a continuous, complex acoustic wave. In this signal, not only the sounds associated with consecutive syllables but also the sounds of consecutive world often overlap considerably.
The ultimate goal of speech perception research is the development of a theory that explains the various phenomena associated with the perception of the human speech signal. To achieve this goal, researchers need two basic types of information: a detailed description of the speech signal, to test whether any acoustic cues exist that could be used by listeners; and accurate measurements of the acts of speech perception, to test hypotheses related to the different theories of speech perception.
When describing the speech signal for a given language, researchers have noted that the signal is composed of a set of basic units called phonemes, which are considered to be the smallest units of speech. The phonemes can be thought of (though this analogy is imprecise) as corresponding somewhat to the letters in a written word. For example, American English is said to contain twenty-five consonant phonemes and seventeen vowel phonemes. The distinction between consonant and vowel speech sounds is based on the degree to which the vocal tract is closed. Consonants are generated with partial or complete closure of the vocal tract during some point of their production. Vowels are created with the vocal tract in a more open state.
Consonants are produced by closing or nearly closing the vocal tract, so they contain relatively little acoustic energy. Because of the dynamic changes occurring in the shape of the resonant cavities of the vocal tract during consonant production, the consonants are difficult to specify exactly in terms of acoustic patterns. Consonants commonly contain bursts of noise, rapid changes of frequencies, or even brief periods of silence, which all may take place within twenty-thousandths of a second.
Vowels have less complex acoustical characteristics, primarily because they are produced with the vocal tract open and do not change its shape so dramatically. They are of relatively long duration and tend to have more constant acoustical features than consonants. The most important features of vowel sounds are their formants, which are narrow ranges of sound frequencies that become enhanced during vowel production. The formants result from basic physical characteristics of the vocal tract, chief among these being its shape for a particular vowel, which cause most of the vocal frequencies to become suppressed, while only a few narrow bands of frequencies (the formants) are reinforced. Formants are numbered in increasing order from the lowest- to the highest-frequency band. The relative-frequency relationships among the formants of a vowel sound characterize that vowel.
Experiments show that the vowel sounds in English speech can be distinguished from one another by reference to the frequency values of formants one and two. For any given vowel sound, however, there is a range of frequencies that typically occurs for the formants, depending on the person speaking and the conditions under which the individual speaks. There is even some overlap between the ranges for some vowels.
Vowels and consonants can be further subdivided according to the articulatory features that characterize production of the sound. Articulatory features include the location of the greatest constriction in the vocal tract, the degree of rounding of the lips, the place of articulation (that is, where in the vocal tract the sound tends to be primarily produced, such as the lips or in the nasal cavity), and the manner of articulation (for example, voiced means the vocal folds vibrate, and voiceless means the vocal folds do not vibrate). These factors are important because of their possible use by a listener during the process of speech perception.
The nervous system can be viewed as consisting of two main subdivisions: transmission systems and integrative systems. For speech perception, the transmission systems both transmit and process the nervous signals that are produced by acoustic stimulation of the sensory structures for hearing. The integrative systems further process the incoming signals from the transmission systems by combining and comparing them with previously stored information. Both systems are actively involved in the processes of speech perception. Much research has been done concerning the exact mechanisms of signal processing in the nervous system and how they enable listeners to analyze complex acoustic speech signals to extract their meaning.
Theories of speech perception can be described in several ways. One way of categorizing the theories labels them as being either top down or bottom up. Top-down theories state that a listener perceives a speech signal based on a series of ongoing hypotheses. The hypotheses evolve at a rather high level of complexity (the top) and are formed as a result of such things as the listener’s knowledge of the situation or the predictability of the further occurrence of certain words in a partially completed sequence. Bottom-up theories take the position that perception is guided simply by reference to the incoming acoustic signal and its acoustic cues. The listener then combines phonemes to derive the words, and the words to produce sentences, thereby proceeding from the simplest elements (the bottom) up toward progressively more complex levels.
A contrasting description is that of active versus passive theories. Active theories state that the listener actively generates hypotheses about the meaning of the incoming speech signal based on various types of information available both in the signal and in its overall context (for example, what has already been said). The listener is said to be using more than simply acoustic cues to give meaning to what has been heard. Passive theories state that the listener automatically (passively) interprets the speech signal based on the acoustic cues that are discerned.
Often, major differences in acoustic waves are produced by different speakers (or the same speaker performing under different conditions) even when speaking the same speech message. Nevertheless, native listeners typically have little trouble understanding the message. This phenomenon, known as perceptual constancy, is probably the most complex problem in the field of speech perception.
Variations in the rate of speaking, the pitch of the voice, the accent of the speaker, the loudness of signal, the absence of particular frequency components (for example, when the signal is heard over a telephone), and other factors are handled with amazing speed and ability by the typical listener. Many variations result in drastic changes or even total elimination of many acoustic cues normally present in the signal.
There is experimental evidence to support the hypothesis that when speech occurs at a higher-than-normal rate, the listener uses both syllable and vowel durations as triggers to adjust the usual stored acoustic cues toward shorter and faster values. This automatic adjustment permits accurate speech perception even when the speaking rate approaches four hundred words per minute.
Another difficult task is to explain the ease with which a listener can understand speech produced by different persons. The largest variations in vocal tract size (especially length) and shape occur between children and adults. Even among adults, significant differences are found, the average woman’s vocal tract being nearly 15 percent shorter than that of the average man. These differences introduce quite drastic shifts in formant frequencies and other frequency-dependent acoustic cues. Nevertheless, experiments show that even very young children generally have no difficulty understanding the speech of complete strangers, which indicates that the nervous system is able to compensate automatically even before much speech perception experience has been garnered.
Studies of human perceptual processing using speech and nonspeech sounds as stimuli provide evidence for differences in the way people deal with these two categories of sounds. The implication is that specialized speech-processing mechanisms exist in the human nervous system. A major difference is a person’s ability to process speech signals at a higher rate than nonspeech signals. Experiments show that phonetic segment information can be perceived as speech at rates as high as thirty segments per second (normal conversation rates transmit about ten segments per second). The rate of perception for comparable nonspeech signals, however, is only about four sounds per second.
The phenomenon of categorical perception of speech refers to the fact that people discriminate quite well among sounds from different speech sound categories (for example, a /b/ as opposed to a /p/ sound, as might occur in the two words “big” and “pig”); however, people’s discrimination of different acoustic examples of sounds from within the same speech sound category (for example, variations of the /b/ sound) is not as good. One theory to explain categorical perception proposes that the auditory system is composed of nerve cells or groups of nerve cells that function as feature detectors that respond whenever a particular acoustic feature is present in a signal. In the example of the sounds /b/ and /p/ from the spoken words “big” and “pig,” according to this theory, there are feature detectors that respond specifically to one or the other of these two consonants, but not to both of them, because of the different acoustic features that they each possess. One problem for proponents of the theory is to describe the particular features to which the detectors respond. Another problem is the number of different feature detectors a person might have or need. For example, is one detector for the consonant /b/ sufficient, or are there multiple /b/ detectors that permit a person to perceive /b/ correctly regardless of the speaker or the context in which the /b/ is spoken (and the consequent variations in the acoustic patterns for the /b/ that are produced)?
Although variations in the immediate speech sound (phonetic) context often result in major changes in the acoustic signature of a phoneme (the phenomenon of context-sensitive cues), a person’s ability to identify the phoneme is remarkable. People can recognize phonemes even though the variations found in the acoustic signatures of a given phoneme when spoken by even a single speaker but in different contexts (for example, for /d/ in the syllables “di,” “de,” “da,” “do,” and “du”) make it difficult to specify any characteristic acoustic features of the phoneme.
Research shows that many acoustic cues (such as short periods of silence, formant changes, or noise bursts) interact with one another in determining a person’s perception of phonemes. Thus, there is no unique cue indicating the occurrence of a particular phoneme in a signal because the cues depend on the context of the phoneme. Even the same acoustic cue can indicate different phonemes, according to the context. A complete theory of speech perception would have to account for all these phenomena, as well as others not mentioned.
Speech sounds represent meanings in a language, and a listener extracts the meanings from a speech signal. What has remained unclear is how the nervous system performs this decoding. One hypothesis is that there are sensory mechanisms that are specialized to decode speech signals. This idea is suggested by the experimental results that indicate differences in the processing of speech and nonspeech signals. An alternative hypothesis is that special speech-processing mechanisms exist at a higher level, operating on the outputs of generalized auditory sensory mechanisms.
In the 1960s, the study of speech perception developed rapidly and three major theories were developed. These motivated a wealth of research projects, assisted by advances in electronic instrumentation, and have formed a basis for the development of later theories. All three theories specify an interaction between the sensory representation of the incoming speech signal and the neuromotor commands (that is, the pattern of signals that the nervous system would have to generate to activate the muscles for speaking) that would be involved in the production of that same signal. Two of the main theories are the motor theory and the auditory model of speech perception.
The first and probably most influential of the theories is Alvin M. Liberman’s motor theory of speech perception. Briefly stated, the motor theory maintains that a listener decodes the incoming speech signal by reference to the neuromotor commands that would be required to produce it. The process of speech perception therefore involves a sort of reverse process to that of speech production, in which a speaker has a message to send and generates appropriate neuromotor commands to enable the articulatory muscles to produce the speech signal. According to the motor theory of speech perception, the listener has an internal neural pattern, generated by the incoming speech signal’s effects on the sensory apparatus. This pattern can be “followed back” to the neuromotor commands that would be necessary to produce an acoustic signal like the one that has just produced the internal (sensory) neural pattern. At this point, the listener recognizes the speech signal, and perception occurs by the listener’s associating the neuromotor commands with the meanings they would encode if the listener were to produce such commands when speaking.
Among the problems facing the motor theory, a major one has been to explain how infants are able to perceive surprisingly small differences in speech signals before they are able to produce these same signals, since it would seem that they do not possess the necessary neuromotor commands. Another problem has been the inability for the supporters of the theory to explain how the “following back” from the incoming signal’s generated neural activity patterns to the appropriate neuromotor commands.
At the other end of the theoretical spectrum from the motor theory, Gunnar Fant’s auditory model of speech perception places greater emphasis on an auditory analysis of the speech signal. This theory proposes that the speech signal is first analyzed by the nervous system so that distinctive acoustic features get extracted or represented in the activity patterns of the nervous system. Then these features are combined into the phonemes and syllables that the listener can recognize. Much as in the motor theory, this recognition depends on the listener possessing basic knowledge about the articulatory processes involved in speech production—in particular, the distinctive phonetic features possible in the language being heard.
In contrast to the motor theory, Fant’s model supposes an ability of the listener’s auditory system to pick out distinctive acoustic features from the phonetic segments being heard. The auditory model, therefore, separates the auditory and articulatory functions more distinctly than the motor theory does.
One of the problems of the auditory model is that distinctive acoustic features of phonetic segments are difficult to specify unambiguously. Supporters of the model argue that the important features are more complex than the relatively simple ones normally proposed and represent characteristic relationships between various parts of the signal.
Bailly, G, et al. Audiovisual Speech Processing. Cambridge: Cambridge UP, 2012. Print.
Behrman, Alison. Speech and Voice Science. San Diego, Calif.: Plural, 2007. Print.
Jekosch, Ute. Voice and Speech Quality Perception: Assessment and Evaluation. New York: Springer, 2005. Print.
Mulak, K. E. "Development of Phonological Constancy: Nineteen-Month-Olds, but not Fifteen-Month Olds, Identify Words in a Non-Native Regional Accent." Child Development 84.6 (2013): 2064–78. Print.
Pisoni, David B., and Robert E. Remez, eds. The Handbook of Speech Perception. Malden, Mass.: Blackwell, 2005. Print.
Tatham, Mark, and Katherine Morton. Speech Production and Perception. New York: Palgrave Macmillan, 2006. Print.
Warren, Richard M. Auditory Perception: An Analysis and Synthesis. 3d ed. New York: Cambridge University Press, 2008. Print.
Wolfe, Jeremy M. Sensation & Perception. 3rd ed. Sunderland: Sinauer, 2012. Print.