The Auditory Figures of Vowel Sounds
From CNBH Acoustic Scale Wiki
AUDITORY IMAGES OF VOWEL SOUNDS
The general characterisitics of vowel sounds
Auditory figures of four strong vowels The upper formants; F3 and F4: The lower formants; F1 and F2:
Speaker Separation and Concurrent Vowels Dominance in Concurrent Vowels. The onset of concurrent vowels
Voice quality in the vowels of men Singer 1 on /i/ Singer 2 on /i/
Voice quality in the vowels of women Singer 3 on /i/ All singers on /a/
AUDITORY IMAGES OF VOWEL SOUNDS
The auditory image emphasises regular time-interval structures that repeat in the neural activity pattern produced by the cochlea. In other words, it emphasises the shapes of the auditory figures these sounds produce. The shapes in turn reveal properties of the source of the sound such as the type of excitation and the form of any resonances. One of the more intriguing sources from the point of view of excitiation and resonances is the human voice, and it is also one of the most important sources in everyday life. This Chapter presents a series of extended examples involving vowel sounds to illustrate a) what makes a sound a vowel, b) the discrimination problem two people speak simultaneously, c) the different voice qualities that trained singers have at their command, and d) the similarities and differences between the voices of men and women.
The general characterisitics of vowel sounds
When quantised temporal integration is applied to the NAP of the vowel in 'mat' (Figure 3.1), the result is the auditory image shown in Figure 4.1a. It shows a tall auditory figure that repeats, indicating the presence of a broadband periodic sound. These are both general characterisitcs of vowel sounds. The pitch is about 140 cps as indicated by the set of vertically aligned peaks at 7.15 ms and the matching set at twice this time-interval. This pitch is about an octave below middle C on the keyboard and it is a typical value for the voice of a male speaker; the average pitch for female speakers is closer to middle C. In the low-frequency channels where the auditory filter is relatively narrow, the image consists of streams of regularly spaced image pulses that decrease slowly in height from right to left across the image. The time intervals between pulses are the same in adjacent channels and so the channels vary primarily in overall level. Thus, the auditory filters isolate individual harmonics of the voice pitch in this frequency region, and the figure components in these channels are typical sinlets. The lowest harmonic is the fundamental with one image pulse per glottal cycle. Harmonics two through five are also isolated with two to five image pulses per glottal cycle. Resolved, low harmonics are also characteristic of vowel sounds. In the upper half of the auditory figure where the auditory filters are broader, the individual harmonics of the voice pitch are not resolved, and each channel contains an asymmetric figure component in which the time intervals between pulses reflect the centre frequency of the channel. The upper half of this vowel, then, is composed of implets, and this too is characteristic of vowel sounds.
The /ae/ in 'mat' is a full vowel, as opposed to a reduced vowel. Auditory images of three more full vowels are presented in the remaining sections of Figure 4.1; the vowels are /b/ as in 'moth' (Figure 4.1b), /i/ as in 'meet' (Figure 4.1c) and /u/ as in 'moot' (Figure 4.1d). In each case, the auditory image reveals a tall, well-defined auditory figure whose lower portion is composed of sinlets and whose upper portion is composed of implets, indicating that these are general properties of vowel sounds. Reduced vowels tend to be softer and shorter in duration with the result that their auditory figures are less well defined. Nevertheless, they are tall figures with sinlets below and implets above and the width is around 8 ms for men and 4 ms for women. These are the characterisitics that define vowel sounds in general terms.
These basic properties of vowel sounds serve to distinguish them from the vast majority of everyday sounds such as the calls of birds and domestic animals, the whirs and bumps of motors and engines, or the noise of computers, wind and rain. It remains the case, however, that there are some musical instruments which produce sounds that fall within the category outlined above -- instruments like the bassoon and cello whose pitch range is commensurate with that of speech and which produce broadband sounds with regular high harmonics. There are more complex properties of vowel sounds which restrict the category still further and which separate vowels from these musical sounds. These properties involve the shapes of the vowel figures and the way the shape changes over time. These shapes issues are the main topic of this Chapter.
Auditory figures of four strong vowels
The auditory figures of maat and moot look a little like tall, thin flag poles with guy wires and triangular flags. The poles are the integrated remains of NAP pulses that initiated temporal integration; the guy wires are the remains of the NAP pulses that occurred just before and after the pulses that make up the pole. The flags are sets of implets whose properties vary regularly across a narrow range of frequency channels. For example, at the top of the auditory figure of maat, around ERB 28, the implets have steep slopes and they are brief because their level is relatively low. As centre frequency decreases to about ERB 25, the level of the implets increases and the slopes decrease a little, so the duration of the implet increases. Thereafter, the level drops, the slope increases and the duration shrinks as centre frequency proceeds down to ERB 24. Together, the set of implets give the auditory figure a triangular profile in this region, much like that of a flag in a wind blowing from left to right across the image. In three dimensions, the set of implets forms a hemi-cone lying on its side. A set of implets in this configuration indicates that the sound source contains a resonance in this frequency region and that the resonance has been excited by some form of acoustic pulse rather than a tone or a square wave. In speech, these resonances are referred to as formants and the excitation is a glottal pulse, that is, an abrupt change in air pressure caused when the vocal folds open xxx once per cycle of the vowel. It is these resonances that give the auditory figures of these sounds their basic vowel shape.
High-frequency formants: The uppermost flag in the maat figure is the fourth formant (F4) of the vowel and it is centred at ERB 25. The centre frequency of F4 varies little across vowels for a given speaker; it rises as much as one ERB in the syllable meet and falls as much as one ERB in the syllable maht, but it is essentially fixed for this listener in the region of ERB 25. The next flag down in maat is the third formant (F3), centred near ERB 22. It is also the case, that the centre frequency of F3 varies little across vowels for a given speaker, although it is more difficult to assess in cases where the third and second formant interact, as in meet. The reason for the relative immobility of F3 and F4 is that the resonances associated with these formants are produced by the teeth and hard palate (the roof of the mouth just behind the teeth) neither of which change shape when we speak.
Although the positions of the centres of the upper formants are fixed, the shapes of these formants show considerable variation. Some of the variation is due to variation in the bandwidth of the resonance in the source. For example, a comparison of F3 and F4 in the vowels maat and moot shows that the flags have extended width and reduced height in moot, indicating that the resonances have narrower bandwidths in the case of moot. In meet, the reverse is true for F4 at least; the flag is broad for several milliseconds after the glottal peak and then it ends in a rounded tip, indicating a broader resonance in this case. Formant bandwidths are part of the information used to identify vowel type in some speech recognition systems and it seems likely that the information would also be used by the auditory system for the same purpose. The taper of the flag in the time-interval structure provides a sensitive measure of formant bandwidth. When formants have this simple flag shape, it indicates that the resonance was excited with a clean waveform consisting of nothing more than a regular sequence of glottal pulses. In this case, the ringing of the the auditory filters falls away monotonically from the peak and the remainder of the period is devoid of activity.
In the vowel maht, there is secondary activity in the channels of F3 and F4 in the latter part of the glottal cycle, as if there were a sudden ripple in the vocal folds at the instant when the reach their maximum opening about half waty through the glottal period. The stability of the fine structure of this feature shows that it is synchronised to the glottal pulse and at least quasi-periodic. An examination of the NAP shows that it comes and goes gradually over 5 to 10 cycles of the vowel. Secondary components in the glottal waveform are not uncommon. They are not thought to affect the vowel type, in the sense of making an /ae/ more like an /i/, but they do alter the quality of the voice and so they probably contribute to voice quality and speaker identification. The syllables presented in Figure 4.1 were spoken in isolation when they were recorded and so the secondary excitation is not due to emphasis of the syllable for semantic or syntactic reasons. Since the positions of the upper formants are dependent on the shape of the speakers mouth and teeth, they vary with speaker a little and this probably contributes to speaker identification as well.
The second formant, F2, is very different from F3 and F4 in the sense that it can be centred anywhere from ERB 21, as in meet, down to ERB 14, as in maht, for the speaker in Figure 4.1. When F2 is in the upper part of this range, as in meet and maat, it has the characterisitcs of a high-frequency formant, appearing as a set of implets in a flag shape. When it is in the lower part of its range, F2 can appear as a double-tailed flag when the resonance is broad, as in maht, or as a set of modulated sinlets when the resonance is narrow, as in moot. These differences are persued in the next sub-section on low-frequency formants.
Finally, note that in the syllable meet, F2 is especially high and it interacts with F3. The profile of the pair of formants (F2 and F3) is like one large flag and speech recognition systems often have difficulty determining the formant positions in this vowel. In the auditory image, this is not so much of a problem because the stabilised time-interval structure shows that the double formant has two tails and a distinct valley between the tails. A single large formant would have a ridge down the middle, just where the formant pair exhibits a valley.
The lower formants; F1 and F2: The majority of the information concerning vowel type ( /ae/, /i/, /u/, etc) is carried by the position and shape of the lower formants, F1 and F2, which occupy the region below the third formant and above the fundamental which, in the case of the author, is below ERB xxx and above ERB xxx. Within this broad range the positions and shapes of the first and second formants vary widely.
For example, in Figure 4.1, the centre frequency of F2 ranges from a low of xxx Hz (ERB 16) in maht to a high of xxx Hz (ERB 21) in meet. The formant appears isolated in the middle of its range (ERB 19) in the syllable maat. In this case, the shape of the formant is a flag like F3 and F4; the fine-structure timing descreases noticably as centre frequency increases across the formant indicating that it is composed of implets. The width of F2 is greater than F3 and F4, partly because F2 is stronger and partly because the filters in this region ring longer than those at higher centre frequencies. In the syllable meet, F2 is so high that its upper edge interacts with the lower edge of F3. Nevertheless, the basic shape of the formant and its centre frequency are still apparent from the implets in the lower half of the formant. In the syllable moot, the centre frequency has moved down to the point where the figure components can be interpreted either as implets that are so long that they overlap in time, or as modulated sinlets. There is little change in the fine-structure spacing as centre frequency increases across the formant and the period is essentially an integer multiple of the fine-structure spacing, indicating that the formant is quite near the eighth harmonic of the period of the vowel. In the syllable maht, F2 is slightly more complicated because it is centred a little below the eighth harmonic and so involves both the eighth and the seventh harmonics. The figure components associated with the seventh harmonic, on the low side of the formant, drop away after the glottal pulse faster than those associated with the eighth harmonic on the high side of the formant. This indicates that the formant is centred closer to harmonic eight than harmonic seven. Thus, when F2 is relatively low in its range, the auditory filters become narrow enough to begin to resolve the individual harmonics of the sound and the formant aquires a forked tail. The relative lengths of the tail components provide information concerning the precise position of the centre of the formant.
The centre frequency of the first formant, F1, ranges from a low of about xxx Hz (ERB 8) in meet to a high of about xxx Hz (ERB 12) in maat and maht. In this frequency region (ERB 7-13), the auditory filters resolve the individual harmonics of the vowels and so the formant is composed of one or more sets of sinlets. Perhaps the simplest example is the F1 in moot where the formant rests on harmonics two and three as indicated by the sets of sinlets centred around ERB 7.5 and 10, respectively. The set of sinlets at the foot of the vowel figure is the first harmonic, or 'fundamental' of the sound. The sinlets of the fundamental are separated from those of the second harmonic by a set of four empty channels and it is not considered part of F1. There is also some activity in the region of the fourth harmonic where there is a low-level, modulated sinlet. The first formant in meet is similar to that in moot but the third harmonic is strongly modulated indicating that the centre of the formant is well below this frequency component, and thus, that F1 is lower in meet than moot. In maat and moot, the centre of F1 is in the region of the fourth and fifth harmonics. In maht, neither the third nor the sixth harmonic is represented indicating that the centre of F1 is between the two harmonics. In maat, the third harmonic is present at a reduced level indicating that the centre is nearer the fourth harmonic than the fifth. In both maat and maht, the second harmonic is prominent as well as the fundamental but neither is part of F1; the fact that third formant is weak or absent shows that the lower bound of F1 is above the second harmonic.
In summary, vowels produce stable auditory figures with a pair of fixed resonances in the region ...
The identity of a vowel -- whether it is /ae/, /i/, /u/, or some other vowel -- is largely determined by the positions and sizes of the formants. Since the lower formants move across broad frequency ranges and the upper formants are essentially fixed, the upper formants have only a small effect on vowel type. They probably serve as anchor points for the measurement of the positions of the lower formants without contributing to vowel type directly. F3 and F4 do play a role, however, in defining the sound as a vowel because vowels normally have two, adjacent, fixed resonances at the upper end of a broad frequency range. Musical notes with equivalent pitch rarely have two high resonances. Furthermore, in speech, the upper formants are fixed throughout a sequence of different vowels, whereas in music all of the resonances tend to move together. In short, when a sequence of auditory figures is capped by a pair of fixed flags, the sound source is probably the human voice.
Speaker Separation and Concurrent Vowels
With the advent of powerful computers, there has been a resurgence of interest in automatic speech recognition and there has been considerable progress in situations where the vocabulary is limited, the speaker is known to the system, and the speech is spoken slowly and clearly. It remains the case, however, that the performance of speech recognition systems deteriorates rapidly when two or more people speak simultaneously. This performance decrement has, in turn, prompted interest in the relative ease with which humans handle multi-speaker environments. Scheffers (1983a) pointed out that even when words occur simultaneously there are usually differences in the pitches of the speakers' voices, and he ran a series of studies to show that listeners presented with pairs of simultaneous vowels can identify them more accurately when they have different pitches. He also developed a simple vowel recognition system that fitted pairs of harmonic sieves to the spectra of concurrent vowels (Scheffers, 1983b). With this spectral dual-pitch preprocessor, the recognition system was able to recognise vowel pairs to some extent and it performed better when there was a pitch difference, but it did not achieve the same level of performance as his observers. Furthermore, the recognition system was not able to benefit from small pitch differences (1/4 semitone) whereas the observers were.
Scheffer's work with concurrent vowels has now been extended by several research groups in an effort to demonstrate that the performance of a recognition system can be improved towards the level achieved by human listeners if the traditional spectral preprocessor is replaced with a full time-domain model of hearing. The motivation for this approach is illustrated in Figure 4.2 which shows two auditory images of the vowels /ae/ and /i/ played concurrently. In the upper panel both vowels have the pitch 100 cps (a 10 ms period); in the lower panel the pitch of the /i/ has been raised to 125 cps (an 8 ms period). Both panels show a set of five resonances in the region 7 to 27 ERBs, but whereas all five resonances are aligned on one vertical in the upper panel, they are segregated onto two verticals in the lower panel. Specifically, the high F1 and low F2 of the /ae/ are aligned on the 10 ms time interval in the region between ERBs 10 xxx and 16 xxx. The low F1 and the high F2 of the /i/ are aligned on the 8 ms time interval. These same four formants exist in the upper panel but there is no basis for segregating them in this way when the pitch is the same. The large, exceptionally long, formant at the top of the figure in the upper panel is shown in the lower panel to be more strongly associated with the 8-ms source than the 10-ms source, although it is not the pair of clean resonances that would be expected at the top of an /i/ vowel. This indicates that the large formant is a mixture of formants from the two vowels and that those of the 8-ms vowel are stronger than those of the 10 ms vowel. This analysis illustrates that the time-interval dimension of the auditory image provides a means of segregating the resonances of two sources when they have differing time-interval structures. This time-interval cue is not available in spectral representations of sound. Spectral preprocessors only specify the level of activity in each channel without regard to regularity in the time intervals. As a result, they preclude the possibility of segregating sources on the basis of alignment on selected time intervals, and this is the reason for believing that a time-interval preprocessor would lead to better recognition performance than a spectral preprocessor.
Assmann and Summerfield (1990) replicated a subset of Scheffer's experiment findings and extended the modelling with a correlogram preprocessor originally suggested by Licklider (1951) as a model of pitch perception. Correlograms of vowel sounds are similar to the auditory images of vowel sounds but the correlograms are symmetric about the main verticals. Assmann and Summerfield (1990) calculated correlograms of their double vowels and then formed a 'pooled correlogram' by summing vertically across channels as suggested by Meddis and Hewitt (1991). The larger peaks in this pooled correlogram were used to predict the pitches of the vowels and the system performed reasonably when the hair-cell stage of the model included a compressive non-linearity. Two separate synchrony spectra were then generated for the two pitches and used to identify the two vowels. The recognition performance of this spectro-temporal model was closer to the human data than that of a simple spectral model, although there was still considerable room for improvement.
Meddis and Hewitt (1991 a xxx b ) have since gone a step further and used the largest peak in the pooled correlogram to identify the pitch of one of a pair of concurrent vowels. They separated channels that show evidence of this pitch from those that did not and applied their recognition system to the two sets of channels separately. This approach has more intuitive appeal and recognition performance rises to a level comparable to that of the listeners in this case. Furthermore, the recognition system was xxx able to benefit from small pitch differences (1/4 semitone) like the observers in Scheffer's experiment. But both approaches indicate that auditory models with sensitivity to regularity in the time intervals will support better recognition performance than spectral preprocessors in the long run.
Dominance in Concurrent Vowels. Over the past decade Bregman (1990) has built up a strong and attractive case for the argument that the auditory system has extensive knowledge about the environment and that it is inclined to interpret incoming sound as an 'auditory scene' that needs to be analysed to locate auditory objects and to separate them from each other and from background sounds. Within this context, the auditory image is a form of window onto the auditory scene. The importance of speech in our lives, and the constant need to separate speech from background noise and other competing speech, means that the field of concurrent vowels is a popular example of auditory scene analysis and a growing area of stream segregation research. When presented with concurrent vowels that have approximately the same loudness, one vowel is usually perceived to dominate and be recognised before the other. In order to investigate this dominance phenomenon McKeown (1990 xxx ) has performed an extensive study in which five vowels were paired and presented to listeners with relative levels that varied over a range of 32 dB in 2 dB steps. The listeners were asked to identify the dominant vowel first and then the non-dominant vowel.
The psychometric functions show complete dominance of each of the five vowels when their relative level is strongest and, in this case, the identification of the non-dominant vowel is essentially random, indicating that the non-dominant vowel is essentially masked when at its lowest relative level. The crossover region of the psychometric function typically occupies about a third of the range and, for any pair of vowels, for any given listener, the function is smooth. However, the crossover point occurs at relative levels that differ by as much as ten decibels for different listeners. This suggests that in the middle range of relative levels the non-dominant vowel is not masked and that dominance is not simply a matter of relative masking. The listeners all had normal audiograms and so to the extent that this predicts normal peripheral processing, their auditory images would be expected to be similar. Thus, it appears that once components of both vowels are audible, the question of dominance is at least partly determined by more central, and as yet unspecified, processes. Subsequently, McKeown (1990 xxx ) went on to examine dominance in concurrent vowels as a function of stimulus duration and found an interesting contrast. One glottal period of a pair of vowels is sufficient to identify the dominant vowel accurately, while performance on the non-dominant rises slowly from chance to an asymptotic value well below that achieved on the dominant vowel over the course of about eight glottal periods.
Voice quality in vowels of men
The syllables in the the first section of this Chapter were spoken in the standard speaking voice of a male English Canadian; the vowels of the second section were produced by a standard synthesiser, imitating a standard male American voice. The human voice is, of course, a much richer source of vowel sounds than exhibited to this point, and this section attempts to broaden the description of vowel sounds with examples of the varying voice qualities available to the trained speaker. In this section the speaker is an American male. The next section presents a comparison with an American woman speaking in the same voices.
Actors and singers are taught to speak in a variety of voices to portray the different voice qualities of people from different geographical regions and to portray different emotional states of a character. One school emphasises a set of six voices that span much of the available voice space and which are referred to by the self-descriptive names, speech, falsetto, sob, twang, opera and belt (Estell et al, 1983). Recently a collaboration was set up to devise a metric for discriminating the voices objectively (Fujimura, 1992). As part of the effort, students who had learned the voices and been judged to produce them well, were recorded as they sang typical examples of /a/ and /i/ in each of the voices. A set of one-second samples of these vowels were digitised for three male and three female speakers and it is fascinating to listen to a randomly ordered presentation of these sounds. On hearing a sample, the sex of the speaker and the vowel type, /ae/ or /i/ are immediately obvious. Furthermore, after a brief introduction, the voice types are largely discriminable, although not as easily as sex and vowel type. Identification of the individual speakers is difficult for the naive listener, but it appeared that those acquainted with the singers were better than chance at identifying them.
Singer 1 on the vowel /i/
Examples of the differences in voice quality are illustrated in Figure 4.3 which shows auditory images of the vowel /i/ sung in four different voices by the same male singer.
Speech: The upper left panel shows the vowel sung in a normal speech voice and it looks much like the /i/ in meet shown in Figure 4.1c xxx. F1 is strong and low in both cases; it is centred between harmonics two and three for the singer in Figure 4.3a and a little lower for the author in Figure 4.1c. F2 is strong and high in both cases; it is centred between ERBs 20 and 21 for the singer and a little higher for the author. There is a little more separation between F2 and F3 in the voice of the singer and a little less separation between F3 and F4, but the two patterns are basically similar.
Falsetto: When the singer switches to a falsetto voice, shown in the lower left panel, the sound level drops across the entire frequency range. The first formant is little changed aside from the reduced level; the upper formants, however, become unstable. The three upper formants can still be distinguished on the 0-ms time-interval, but on the 7-ms and 14-ms intervals, their activity blends and the flag shape is seen only for F2 on 7-ms. The falsetto voice is described as weak and breathy indicating the presence of noise in the sound. On the 14-ms interval, there is temporal irregularity in the regions of F2 and F4 indicative of noisiness in the upper formants.
In Figure 4.3, the difference in level between the upper formants of the speech /i/ and the falsetto /i/ is actually less than that in the sounds themselves. The level of a voice is one of the cues to voice quality; falsetto and sob are quieter than speech, twang and opera are louder than speech, and belt is the loudest of all. When making the images, the sounds were normalised to remove much but not all of the level differences and so make it easier to compare the remaining aspects of voice quality.
Sob: One of the two voices not shown is 'sob'. It is similar to falsetto in the sense of being a weak voice with attenuated upper formants. It differs in not having the breathiness associated with falsetto. Sob also has a longer term temporal characterisitic that is not shown in a single frame of the auditory image; when properly performed, the voice shakes or quakes as the singer attempts to convey the impression of someone on the verge of tears. This temporal difference means that the dynamic images of falsetto and sob are easily distinguished.
Twang: The upper right hand panel of Figure 4.3 shows the auditory image of /i/ sung with 'twang' in the voice. Twang is a nasal voice; that is, nasality is emphasised and used more than the language requires to make nasal distinctions like the difference betweenthe consonants /m/ and /n/ . A familiar example, is the voice of Bugs Bunny, and in particular, the long sliding vowel sound "Yiaaaah" just before the famous phrase "What's up Doc?" Twang is a tense voice that has an edge or sharpness to it. You can feel the tenseness when imitating the sound, especially along the margin of the tongue. The auditory image shows that the upper formants are stronger that in normal speech. This is both a relative and an absolute difference. In the lower part of these subfigures, the F1 in the twang image is seen to be weaker than the F1 in the normal speech image. This is due to the loudness adjustment applied to the stimuli. In the original recordings the F1 is about the same level in both voices. Thus, the difference in the upper formants is actually larger in the recordings than shown in the images.
There is also more temporal regularity in twang than in normal speech. The regularity takes three forms: firstly, the fine-structure within the main formant structure is highly regular; secondly, extra features, like that in the middle of the period in the channels of F4, are more regular than in normal speech; and thirdly, these momentary regularities seen in individual frames of the auditory image, are more persistent over frames. The effect is particularly apparent in the dynamic form of the auditory image where a sequence of frames of the auditory image calculated off line and played back rapidly as a simulation of the real time auditory image. In these videos, the components of the twang voice are seen to be more stable than those in the speech voice.
The shape and position of the F1 in twang are essentially the same as the F1 in normal speech. The F2 in the twang sample is a little higher than that in the speech sample, and the F3 is a little lower in the twang sample. But the largest difference in the F1-F2 region of these auditory figures is the appearance of a 'pseudo-formant on the sixth harmonic in the twang voice. It is well above the first formant in frequency and its level indicates that it would contribute to the timbre of the sound. Human listeners probably just hear this pseudo-formant as an aspect of the twang quality. The same might not be true for a mechanical recognition system, however, because F2 and F3 have merged in this vowel. As a result, a recogniser looking for four formants might take the undifferentiated energy concentration in the upper frequency region to be F3 and F4, and take the pseudo-formant on F6 to be a low F2, which would lead the recogniser to suggest a vowel nearer to /a/ then /i/.
Opera: The auditory image for the opera voice is presented in the lower right hand panel of Figure 4.3. In this frame the source is seen to be like twang inasmuch as it is temporally regular and there is no noise in the image anywhere. Even the smallest features occur in each of the auditory figures indicating that the sound is close to periodic in the short term, that is, over 30 ms or so. Over the longer time scale, however, there is a difference between twang and opera that is obvious both in the sound and the images, and that difference is vibrato. The pitch and amplitude of the opera voice oscilate at the rate of 6-8 cycles/second, whereas the pitch and amplitude of twang are stable. The opera frame in Figrue 4.3 comes from a moment when the pitch and amplitude have just reached the top of the vibrato range and before starting down again. At this point in the cycle, the close to periodic in the short term and the copies of the auditory figure that appear in the image are very similar. During the short pitch glides of the opera voice, the auditory figures maintain the same same general shape but the individual pulses of the fine structure become asymmetric towards the left when the pitch is rising and towards the right when it is falling, and the degree of asymmetry varies, being greatest in the auditory figure on the left of the image. As a result, the dynamic images of opera and twang are easily distinguished.
Aside from vibrato, the striking aspect of the opera voice is the formant structure. The F1 is just like that of the other voices but the remaining formants present a more complicated picture. There is what appears to be a weak pseudo-formant on the eighth harmonic where /i/ usually has none, and above the pseudo-formant is a cluster of four strong formants. The upper three of the strong formants are centred near ERBs 21, 23 and 26, which corresponds well with the centres of the upper three formants in the speech /i/ (Figure 4.3) and the upper three formants in the /i/ of meet presented earlier (Figure 4.1). The remaining formant is centred near ERB 19 and is probably what is commonly referred to as the singer's formant. This formant would undoubtedly confuse a machine recognition system and it is interesting to consider why it does not confuse the human speech system. Presumably, the humans system learns to expect a different pattern for /i/ when the context indicates that the use of an operatic voice.
Belt: The belting voice is like that of a barker at a fairground -- a very strong voice presented at a loud level so that it will be heard in the presence of competing voices and so that it will carry to people at a distance. The auditory image of the belt version of /i/ is not shown. It is similar to the speech /i/ but it is very regular, both in the shorter term as revealed in individual frames of the auditory image, and in the longer term as revealed in dynamic auditory images. F1 and F4 are in the expected places; F2 and F3 combine to form one large formant as is commonly the case in /i/. There is no singer's formant and no pseudo formant. Part way through the period of the upper formants there is a feature indicating that the excitation waveform has more than one peak per cycle xxx.
In summary, with regard to voice quality, the versions of /i/ in Figure 4.3 suggest some broad characterisitics that the auditory system might use to categorise the voice quality: Falsetto might be distinguished from speech, twang and opera by the lack of temporal regularity in the upper formants. Twang and opera might be distinguished from speech and falsetto by the presence of highly regular, relativley strong, upper formants and the presence of a pseudo formant between F1 and F2. Opera might be distinguished from all the others, and particularly twang, by the presence of the singer's formant and vibrato. With regard to vowel type, the similarities in the images of Figure 4.3 provide information about the class of sounds that can represent the phoneme /i/: All of the images have a strong low F1 centred close to the second harmonic. They all have a relatively quiet region in the middle. They all have a high F2 and it is usually strong. The details of the shapes of the formants might provide clues to the identity of the singer. Before persuing the discussion of voice quality, vowel type and songer identity, it is useful to compare these images to those from another singer to determine whether the characteristics are consistent across singers.
Singer 2 on the vowel /i/
Auditory images of the sounds produced by a second man singing /i/ in the same four voices are shown in the four panels of Figure 4.4. The pitch of this man's voice is a little higher than that of the first singer (around 150 cps rather than 140 cps).
Speech: The upper part of the speech image of the second singer (Figure 4.4a) reveals three formants, F2, F3 and F4, centred just above 21, 23, and 26 ERB's, respectively. The temporal regularity in this frequency region is reduced with respect to that observed in the speech of the first singer, and the strength of the upper formants is reduced with respect to that of F1. But the second singer is not less typical of the population; rather, the two singers between them illustrate the range of variation of the speech voice. The lower part of the speech image of the second singer presents a bit of a puzzle: At first glance, it appears to show a broad F1 involving harmonics one through four, centred at an unusually high position between harmonics two and three. There is a difficulty with this interprestation, however, because harmonic three is very weak relative to harmonics two and four. The alternative interprestation is that F1 is actually centred between the first and second harmonics, and the prominent fourth harmonic is a rather low pseudo formant like those in the twang and opera voices of the first singer.
Falsetto: The falsetto image for the second singer provides support for the latter interpretation of the speech voice. It shows a strong, very low F1 centred on the first and second harmonics and with the third harmonic totally supressed. Then, in the broad region between F1 and F2 there is a prominent formant centred just below the fifth harmonic. The strength of this formant so great as to suggest that it is a singer's formant but the position around ERB 14 is rather low for the singer's formant which is normally up around ERB 19. On the other hand, it seems odd to refer to such a prominent formant as a pseudo formant. The upper part of the image shows activity from ERB 18 to 29, some of it temporally regular and some temporally irregular as is normally the case for falsetto. On the 6-ms time interval the activity is fairly regular in the region of ERB 22 and ERB 26 which are probably F2 and F4, respectively. There is some evidence of a weak F3 on the high side of F2 on the 12- and 16-ms time-intervals. The activity above F4 is escaping breath that appears as high-frequency noise in channels that are not dominated by formants. Finally, there is an indication of a weak singer's formant between ERBs 18 and 19 where there is low-level activity that is reasonably regular and which peaks on multiples of the pitch period. In summary, it is possible to interpret this sound as a modfied /i/ if the the formants at ERBs 14 and 18 are considered exceptional and if it is assumed that the activity from ERBs 21 to 27 represents three formants, one of which is weak.
Twang: The general characterisitics of the twang voices of the two singers are quite similar. F1 is centred between the second and third harmonics and there is a weak pseudo formant between F1 and F2. The upper three formants are centred just above ERBs 21, 23 and 26 which is close to where they are centred in the speech voice of the same singer (Figure 4.4a). All of the activity in the auditory image is temporally regular and the regularity extends to the longest time intervals and to the highest channels. The upper formants of the second singer are somewhat more complex than those of the first, especially F4 where there are consistent secondary peaks in the fine-structure of the tail of the formant. There is also more activity above F4 in the case of the second singer. But overall the two images are more similar than different.
Opera: The final panel of Figure 4.4 shows that the opera voice is temporally regular and it has strong upper formants like the twang voice of the same singer, and like the opera and twang voices of the first singer. Above F4, there is less activity in the opera voice of the second singer than in the twang voice, and the same distinction between the voices exists for the first singer. This is probably part of what makes opera a smoother sound than twang. Broadly speaking, the pattern of activity for opera voice of the second singer is like that for the first singer, but a closer examination shows that it is difficult to interpret opera /i/ of the second singer in terms of formants in the traditional way. There is only one large concentration of energy in the region normally occupied by F2, F3 and F4 of an /i/ sound; that is, ERBs 21-27. There is a formant with a normal shape between ERBs 19 and 21 which might be a rather low F2 for this singer, or more likely, a high singer's formant since the voice is opera. If the latter is the case, than the large upper formant is a combination of F2, F3 and F4, but it is difficult to say from this frame. Furthermore, the lower half of the image shows a sequence of resolved harmonics that do not group readily into formants either. It is possible that these components represent a low F1 and two pseudo formants between F1 and the singer's formant, but this seems a considerable stretching of the traditional analysis.
Voice quality in the vowels of women
The vowels of the womens voices are broadband periodic sounds like those of men but, on average, the pitch of a womans voice is about an octave above the pitch of a mans. Auditory images of a woman and a man singing /i/ on their natural pitches are presented in the upper and lower panels of Figure 4.4. The pitch difference is immediately apparent; the auditory figures are about half the width in the upper panel and the base of the auditory figure has shifted up from about Erb 5 to ERB 9. Both of this changes follow from the fact that the rate of glottal pulses in the woman's voice is about double that in the man's voice. Aside from the scale change imposed by the pitch shift, the auditory figures of the woman's /i/ are similar to those of the man's, as would be expected. In both cases, there is a low F1 which is separated by a distinct gap from a set of higher formants -- the standard pattern for an /i/. It is assumed that the part of the auditory system that processes the auditory image understands the scale change associated with a pitch change and that it can select the appropriate scale for a given analysis. Accordingly, to facilitate the comparison of voice quality, independent of pitch, the width of the auditory images is halved for women's vowels in the Figures that follow, and the minimum centre frequency is raised an octave.
Just before proceeding, note that the voice quality in Figure 4.4 is belt. The auditory figures show that it is a strong voice and the fine structure is temporally regular like that of twang and opera. There is evidence of a pseudo formant between F1 and F2 in the region of ERB 17 in both the woman's and the man's voices. The resonance at ERB 19 in the womans voice is either F2 at a relatively low frequency, or a singers formant xxx . There is no singer's formant in the current sample of the belting voice of the man. [ xxx It seems likely, however, that the presence or absence of a singer's formant is an individual difference invloving an interaction between vowel type and speaker rather than a sex difference. xxx ]
Singer 3 on the vowel /i/
Auditory images of the vowel /i/ sung by a women in four different voices are presented in the Figure 4.6. They are the same voices as those employed by the male singers in Figures 4.3 and 4.4, that is, speech, falsetto, twang and opera. A comparison of the individual panels of this figure with the corresponding panels in Figures 4.3 and 4.4 reveals many similarities. In each image of the woman's voice there is a low F1, just as there is in each /i/ of the mens' voices, and in each case, it is very regular temporally. In the central region of each image, there is a gap in the activity at about the same place as in the mens' voices (Erb's 15-18). In the upper portion of the image there is a cluster of high-frequency formants in all of the images.
There are also similarities in the distinctions between the woman's different voices and the distinctions between the mens' different voices. The upper formants of the speech voice of the woman are more regular than those in the falsetto voice. The upper formants of the twang and opera voices are very regular and they are stronger than those in the speech and falsetto voices. There is more activity at the top of the image in the twang voice than there is in the opera voice and the formants are more rounded in the opera voice. So there is evidence that the characterisitcs that define vowel type and voice quality are the same for women and men. Auditory images for another women singer lead to the same conclusion, although they are omitted here for brevity.
There are also consistent differences between the woman's voice and those of the men. The low harmonics of the woman's voice are more clearly resolved. This is a direct consequence of the higher pitch of the voice combined with the fact that the auditory filters get relatively narrower as centre frequency increases, and it is a general characterisitic of women's speech. The increased resolution means that it is more difficult to specify the position of F1 for women. Nevertheless, the main difference is apparent in Figure 4.6. The centre frequency of F1 in the woman's voice is between the first and socond harmonics, in the region of ERB 10-12, whereas in the men's voices, F1 is between the second and third harmonics, in the region of ERB 8-10. Thus, in absolute terms, F1 is higher in the woman's voice than in the mens', but in relative terms it is lower. Whereas the pitch shift is about an octave, the formant shift is less than half an octave, and this is generally true for women's voices.
At the same time, however, the second harmonic in the woman's voice exhibits more modulation in the centre of the cycle than its counterpart in the men's voices. The same is also true for the fourth singer's voice. The origin of this effect is not clear; greater resolution is usually accompanied by a reduction in modulation rather than an increase.
The upper formants in the mens' voices fall in the region between ERB 20 and 30, with F2 near ERB 21, F3 near ERB 23 and F4 near ERB 26. There is activity in the region ERB 20-30 in each image of the woman's voice in Figure 4.6, and in each case there is evidence that the activity is grouped into formants. But in none of the images is the activity grouped into three formants near ERBs 21, 23 and 26. The same is true for the images of the second female singer. Formants most like those seen in the mens' voices appear in the twang voice of the woman where there are flag shapes centred at ERBs 24 and 27. It seems likely that these are F3 and F4 and that the activity near ERB 21 is F2 even though the typical rightwards extension of activity is muted in this case. A similar analysis could be applied to the speech image where there are credible formant shapes at ERB 24 and 27 and activity in the region of ERB 21, although here the shape is even less formant like. If this is the correct interpretation of the number and position of the upper formants in the speech and twang images, then the activity near ERB 19 in these images is a weak pseudo formant on the fifth harmonic. The upper formants are poorly defined in the falsetto image but what regular activity there is, is centred between ERBs 23 and 24 where a strong F2 would be expected. The same analysis cannot, however, apply to the opera voice in Figure 4.6d where there are two strong and broad formant-like shapes centred in the region of ERBs 21 and 25. In this case, the obvious interpretation is that F3 and F4 have merged and it is accompanied by an F2 that is larger and stronger than anything observed in the other three voices. It is not an unreasonable interpretation but if it is correct, than the characterisitics of an /i/ in a woman's voice appear to be dependent on the voice quality in which she chooses to speak. The images for the second woman show the same strong F2 in the opera voice contrasting with weak F2's in the other voices.
Four singers on the vowel /a/
The auditory images for the other vowel, /a/, provide ....
Figure 4.1. Auditory images of the full vowels /ae/, /a/, /i/ and /u/. The positions of the lower formants largely determine the vowel type; the upper formants have relatively fixed positions. (xxx 2 fig-pg)
Figure 4.2. Auditory images of the vowels /ae/ and /i/ played concurrently. In the upper panel the vowels have the same pitch (100 cps); in the lower panel the pitch of the /i/ is 125 cps. It is easier to see the presence of the second vowel and to assign the formants to the correct source in the lower panel. (xxx 1 fig-pg)
Figure 4.3. Auditory images of the vowel /i/ from a male singer using four different voice qualities: a) speech, b) falsetto, c) twang, and d) opera. (xxx 2 fig-pg)
Figure 4.4. Auditory images of the vowel /i/ from a second male singer using the same four voice qualities: a) speech, b) falsetto, c) twang, and d) opera. (xxx 2 fig-pg)
Figure 4.5. Auditory images of the vowel /i/ sung by a) a woman and b) a man; they are singers 3 and 1, repsectively. (xxx 1 fig-pg)
Figure 4.6. Auditory images of the vowel /i/ sung by a woman (singer 3) in four different voice qualities: a) speech, b) falsetto, c) twang, and d) opera. (xxx 2 fig-pg)