From CNBH Acoustic Scale Wiki
Roy Patterson , Etienne Gaudrain, Tom Walters
5. The Acoustic Properties of Pulse-Resonance Sounds and the Auditory Variables of Perception
The final section of this Chapter reviews the relationship between the acoustic properties of sound and three variables of auditory perception, loudness, pitch and timbre, to illustrate how they relate to the variables of music perception described in the sections above, namely, melody, instrument family and register within a family. The American National Standards Institute (ANSI) has provided official definitions of loudness, pitch, and timbre, and these definitions are widely quoted. This Section begins with the definitions as they appear in ANSI (1994), since they might have been expected to specify just those relationships between physical and perceptual variables that we require to explain the perception of musical notes. The definitions are:
- 12.03 loudness. That attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from soft to loud.
- 12.01 pitch. That attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from low to high. Pitch depends primarily upon the frequency content of the sound stimulus, but it also depends upon the sound pressure and the waveform of the stimulus. Note — the pitch of a sound may be described by the frequency or frequency level of that simple tone having a specified sound pressure level that is judged by listeners to produce the same pitch.
- 12.09 timbre. That attribute of auditory sensation which enables a listener to judge that two nonidentical sounds, similarly presented and having the same loudness and pitch, are dissimilar. Note - Timbre depends primarily upon the frequency spectrum, although it also depends upon the sound pressure and the temporal characteristics of the sound.
These definitions are useful, inasmuch as they illustrate the desire to relate properties of perception to physical properties of sound, and they illustrate what is regarded by auditory scientists as a principled way of proceeding with this task. Unfortunately, the definitions focus on the perceptual properties without, in the end, specifying the relationship of each to the corresponding, acoustic, or physical variables, other than to say that both pitch and timbre depend primarily upon the frequency content of the sound. While true, this is not very helpful since it does not say which aspect of the frequency information is associated with pitch and which aspect is associated with timbre. The discussion of acoustic scale in Section 2 suggests that, for musical sounds at least, we can be more specific about the relationship between the acoustic properties of sound and the perceptions associated with musical notes and instruments. In particular, Ss, the position of the fine structure of the magnitude spectrum, largely determines the pitch of a musical note, and a melody is an ordered sequence of Ss values. The shape of the spectral envelope is closely associated with the perception of instrument family, or the family aspect of timbre. So it is envelope shape that supports the general distinction between, for example, brass and string instruments. And, Sf, the position of the envelope of the magnitude spectrum, combines with Ss to determine the register of the instrument within a family. The acoustic scale variables Ss and Sf are also prime determinants of our perception of the size of an instrument or the height of a singer. In this final section of the chapter, we review the relationship between these acoustic properties of sound and the traditional auditory variables, pitch and timbre, with a view to developing a more useful description of the mapping between the acoustic and auditory variables as they pertain to music perception.
5.1 The Effect of Source Size on Pitch and Timbre
Consider the definitions of pitch and timbre, and the question of how we perceive the physical changes that take place in a vowel as a child grows up, or how we perceive the physical changes that take place in a musical note as it is played on successively larger members of an instrument family, for example, when a trumpet, trombone, and tuba play C3, one after another. The logic of the ANSI definition of timbre is not entirely clear, but it would appear to involve a process of elimination, in which variables of auditory perception that do not affect timbre are identified and separated from the remaining variables, which by default are part of timbre. The perceptual variables of particular interest are duration, loudness and pitch.
Duration is the variable that is most obviously separable from timbre, and it illustrates the logic underlying the definition of timbre (although there is not actually a standard definition of the perception of duration). If a singer holds a note for a longer rather than a shorter period, it produces a discriminable change in the sound but it is not a change in timbre. Duration has no effect on the magnitude spectrum of a sound, once the duration is well beyond that of the temporal window used to produce the magnitude spectrum. The sustained notes of music are typically longer than 200 ms in duration, and the window used to produce the magnitude spectrum is usually less than 100 ms, so duration is unlikely to play a significant role in family timbre or register timbre. In general, then, the perceptual change associated with a change in the duration of a sustained note is separable from changes in the timbre of the note.
Loudness is also largely separable from timbre. If we turn up the volume control when playing a recording, the change will be perceived predominantly as an increase in loudness. The pitch of any given vowel and the timbre of that vowel will be essentially unaffected by the manipulation. The increase in the intensity of the sound produces a change in the magnitude spectrum of the vowel — both the fine structure and the envelope shift vertically upwards — but there is no change in the frequencies of the components of the fine structure and there is no change in the relative amplitudes of the harmonics. Nor is there any change in the shape of the spectral envelope. So, loudness is also separable from timbre.
Thus, acoustic variables that do not affect either the shape of the envelope of the magnitude spectrum or the frequencies of the spectral components do not affect the timbre of the sound. The question is: ‘What happens when a simple shift is applied to the position of the fine structure, or to the position of the envelope, of a sound (on a log-frequency axis), that is, when we change Ss, Sf, or both?’ The current definition of timbre suggests that a change in Ss, which is heard as a change in pitch, does not affect the timbre of the sound, whereas a change in Sf, which is heard as a change in speaker size or instrument size, does affect the timbre of the sound. This is where the current definition of timbre becomes problematic, that is, when it treats the two aspects of acoustic scale differently with regard to their role in the perception of timbre.
Note, in passing, that shifting the position of the fine structure of the magnitude spectrum, while holding the envelope fixed, produces large changes in the relative amplitudes of the harmonics as they move through the region of a formant peak. So the relative magnitude of the components in the spectrum can change substantially without producing a change in timbre, by the current definition. Note, also, that shifting the envelope of the magnitude spectrum while holding the position of the fine-structure constant produces similar changes in the relative amplitudes of the component frequencies as they move through formant regions. Such shifts do not change the timbre category of a musical sound (the family timbre); they change the apparent size of the source, and if the change is large enough they change the perceived register of the instrument, which, of course, is a timbre change, by the current definition.
5.2 Acoustic Scale ‘Melodies’ and the Perception of Pitch and Timbre
The discussion focuses on a set of four melodies designed to emphasize the role of the acoustic scale variables in the perception of vocal pitch and timbre. The novel aspect of the melodies is that, in some cases, the acoustic scale of the filter, Sf, varies over the course of the melody, either on its own, or in conjunction with changes in Ss. The scale of the filter is normally fixed when an instrument plays a melody. A form of musical notation for the melodies is presented in Figure 4; it shows that the melodies all have four bars containing a total of eight notes. The melodies are in ¾ time, with the fourth and eight notes extended to give the sequence a musical feel. The melodies have a ‘phonological text,’ that is, the notes are sung as syllables (pi, pe, ko, kuuu; ni, ne, mo, muuu), which emphasizes the human quality of the voice. As the timbre changes from vowel to vowel, it engages the phonological system and allows us to distinguish the role of envelope shape in melody perception, from the role of Ss and the role of Sf. The phonological text is the same for all four melodies.
The syllables were originally sung by an adult male (author RP) who has an average GPR of about 120 cps and a vocal tract length of about 16.5 cm. STRAIGHT (Kawahara and Irino 2004) was used to vary the scale of the source, Ss and the scale of the filter, Sf, for each of the syllables, to simulate changes in the GPR and VTL of the singer. The matrix of tones used to produce the melodies is shown in Figure 5. The abscissa of the matrix (x-axis) is the acoustic scale of the source, Ss, and it was varied to produce an octave of notes using the diatonic major scale of Western music. The ordinate of the matrix (y-axis) is the acoustic scale of the filter, Sf, and it was varied to simulate voices with an octave range of vocal tract lengths ranging from about 10 to 20 cm. As with the Ss dimension, the specific values of Sf were determined by the diatonic major scale of Western music. In other words, the Sf ratio between any two notes has the same numerical value as the corresponding Ss ratio, and the values of the Sf ratios are indicated in musical notation by the note names, C, D, E etc. The manipulation of Sf effectively extends the domain of notes from a diatonic musical scale to a diatonic musical plane as shown in Figure 5.
The arrows in Figure 5 show the sequences of notes in each melody. This alternative notation for the melodies illustrates the interaction of the acoustic scale variables. Returning to Figure 4, for each melody, the black notes show the progression of intervals for Ss (or GPR) as each melody proceeds, and the grey notes show the progression of intervals for Sf (or VTL) as the melody proceeds. The sound files for the melodies are available at http://www.acousticscale.org/link/SHAR2009Demo. The shaded note [E, E] on the Ss-Sf plane provides the anchor for the notation; it has the same GPR and VTL values as the original syllables.
5.2.1 Melody 1
The first melody simulates the normal situation wherein a singer with a fixed vocal tract length (VTL) varies the tension of the vocal cords to vary Ss in accordance with the black notes in Staff (1) of Figure 4. The grey notes (for Sf) do not vary in this melody, indicating that the VTL of the singer is fixed. The VTL is relatively long, so the singer is heard to be an adult male. The pitch of the voice drops by an octave over the course of the melody from about 200 cps, which is well above the original pitch, down to about 100 cps, which is a few notes below the original pitch. This descending melody is within the normal range for a tenor, and the melody sounds natural. As the melody proceeds, the fine-structure of the spectrum, Ss, shifts, as a unit, with each change in GPR, and over the course of the melody, it shifts an octave towards the origin. The ANSI definition of timbre implies that these relatively large Ss changes, which produce large pitch changes, do not produce timbre changes, and this seems entirely compatible with what we hear in this melody. So, this melody illustrates the commonly held belief, embodied in the ANSI definitions, that pitch is largely separable from timbre, much as duration and loudness are.
5.2.2 Melody 2
Problems arise when we extend the example and synthesize a version of the same melody but with a singer that has a much shorter vocal tract, like that of a small child [Fig. 4, Staff (2)]. There is no problem at the start of the melody; it just sounds like a child singing the melody. The starting pitch is low for the voice of a small child but not impossibly so. As the melody proceeds, however, the pitch decreases by a full octave, which takes it beyond the normal range for a child. As a result, in the latter part of the melody, we hear the voice quality change and, by the end of the melody, the child comes to sound rather more like a dwarf. The ANSI definition of timbre does not provide any basis for understanding the voice quality change from a child to a dwarf; within the tradition framework the changes that we hear as the melody proceeds are just pitch changes. But traditionally, voice quality changes associated with a change in speaker changed are regarded as timbre changes. This is the first form of problem with the standard definition of timbre — changes that are nominally pitch changes producing what would normally be classified as a timbre change.
5.2.3 Melody 3
The next example [Fig. 4, Staff (3)], the roles of the acoustic-scale variables, Ss and Sf, are reversed. The position of the fine structure, Ss, is held fixed while the position of the envelope, Sf, shifts by an octave towards the origin. The change in Sf simulates a doubling of the VTL, from about 10 to 20 cm, which would normally be associated with a doubling of height. The Sf ratios between successive notes of the melody have the same numerical values as the Ss ratios of the first two melodies. As Melody 3 proceeds and the envelope shifts down by an octave, the child seems to get ever larger, the voice comes to sound something like that of a counter tenor, that is, a tall person with an inordinately high pitch. The ANSI definition of timbre does not say anything specific about how changes in the position of the spectral envelope affect timbre or voice quality; the acoustic scale variable, Sf, was not recognized when these standards were written. Nevertheless, the definition gives the impression that any change in the spectrum that produces an audible change in the perception of the sound, without producing a change in duration, loudness or pitch, produces a change in timbre. Experiments with scaled vowels and syllables show that the just noticeable change in Sf is about 7% for vowels (Smith et al. 2005) and 5% for syllables (Ives et al. 2005), so all of the intervals in the melody would be expected to produce perceptible Sf changes. Since traditionally, voice quality changes are thought to be timbre changes, the fact that the singer at the start of the melody (a child) is different from the singer at the end of the melody (a counter tenor) seems compatible with the definition of timbre; the singer changes and the timbre changes. However, we are left with the problem that large changes in Ss and Sf both seem to produce changes in voice quality, but whereas the perceptual changes associated with large shifts of the fine-structure along the log-frequency axis are not timbre changes, the perceptual changes associated with large shifts of the envelope along the same log-frequency axis are timbre changes, according to the ANSI definition. They both produce changes in the relative amplitudes of the spectral components, but neither changes the shape of the envelope and neither form of shift alters the phonological values of the individual syllables.
5.2.4 Melody 4
The problems involved in attempting to unify the perception of voice quality with the definition of timbre become more complex when we consider melodies where both Ss and Sf change as the melody proceeds. Consider the melody produced by co-varying Ss and Sf to produce the notes along the diagonal of the Ss-Sf plane (Fig. 5). The musical notation for the melody is shown in Figure 4, Staff (4). This melody is perceived to descend an octave as the sequence proceeds, and there is a progressive increase in the perceived size of the singer from a child to an adult male (with one momentary reversal at the start of the second phrase). It is as if we had a set of singers varying in age from 4 to 18 in a row on stage, and we had them each sing their assigned syllable in order, and in time, to produce the melody. This melody, in combination with the others, makes it clear that there is an entire plane of singers with different vocal qualities defined by different combinations of the acoustic scale variables, Ss and Sf. The realization that there is a whole plane of voice qualities makes it clear just how difficult it would be to produce a clean definition of timbre that excludes one of the acoustic scale variables, Ss, and not the other, Sf. If changes in voice quality are changes in timbre, then changes in pitch (Ss) can produce changes in timbre. This would seem to undermine the utility of the current definitions of pitch and timbre.
5.3 Fitting the Concept of Acoustic Scale in the Definition of Pitch and Timbre
5.3.1 The ‘Second Dimension of Pitch’ Hypothesis
At first glance, there would appear to be a fairly simple way to solve the problem; we could designate the perceptual dimension associated with the acoustic scale of the filter, Sf, to be a second dimension of pitch. Then, this second dimension of pitch could be excluded from the definition of timbre along with the first dimension of pitch. For the singing voice, manipulation of Sf on its own would sound like the change in perception that occurs over the course of Melody 3, where Ss is fixed on the upper C and Sf decreases by a factor of two over the course of the melody. This does, however, lead to several problems. Firstly, semitone changes in the scale of the filter, Sf, are not large enough to clear differences in the associated perception so this second version of pitch would not support accurate perception of novel melodies, in the way that the first form of pitch does (e.g., Pressnitzer et al. 2001; Ives and Patterson 2008). The salience of changes in Sf is more like the salience of the weak Ss pitch that arises when the energy in a tone is restricted to high, unresolved harmonics, and pitch discrimination requires Ss changes of four semitones, or more. The second form of pitch would, in some sense, satisfy the ANSI definition of pitch which is not concerned with melodies, and which only requires that the attribute of auditory sensation can be used to order notes on a scale extending from low to high. It seems reasonable to say that the tones at the start of Melody 3 sound “higher” than the tones at the end of the melody, which would support the ‘second dimension of pitch’ hypothesis.
The ‘second dimension of pitch’ hypothesis also leads to another problem. To determine the pitch of a sound, it is traditional to match the pitch of that sound to the pitch of either a sinusoid or a click train, that is, to a perception that is based on the scale of the source, Ss. Moreover, it seems likely that if listeners were asked to pitch match each of the notes in Melody 3, among a larger set of sounds that diverted attention from the orderly progression of Sf in the melody, they would probably match all of the tones with the same sinusoid or the same click train, and the pitch of the matching stimulus (bound to an Ss value) would be the upper C. This would leave us with the problem that the second form of pitch, based on Sf, changes the perception of the sound but it does not change the pitch to which the sound is matched (its Ss value). So the “pitch” change associated with a change in Sf would have to be segregated from a normal pitch change and given a separate definition. It would also require changes in the ANSI definitions of pitch and timbre because currently, a change in perception (like that associated with changes in Sf ) that does not produce a change in Ss pitch (or loudness, or duration) is a change in timbre. In short, the ‘second dimension of pitch’ hypothesis would appear to lead us back to the position that changes in Sf produce changes in the timbre of the sound.
The ‘second dimension of pitch’ hypothesis also implies that if we play a random sequence of notes on the musical plane of Figure 5, the voice quality changes that we hear are all pitch changes, and they involve no change in timbre. This seems unreasonable when the acoustic scale changes are sufficiently large to produce a clear change in the perception of who is singing.
Finally, there is the problem that many people hear the perceptual change in Melody 3 as a change in speaker size, and they hear a more pronounced change in speaker size when changes in Sf are combined with changes in Ss, as in Melody 4. To ignore the perception of speaker size, is another problem inherent in the ‘second dimension of pitch’ hypothesis; source size is an important aspect of perception, and pretending that changes in the perception of source size are just pitch changes seems like a fundamental mistake for a model of perception.
5.3.2 The Scale of the Filter, Sf, as a Dimension of Timbre
Rather than co-opting the acoustic scale of the filter, Sf, to be a second dimension of pitch, it would seem more reasonable to think of it as an internal dimension of timbre – a dimension of timbre which for voices is associated with vocal register, singer sex and singer size. This, however, leads to a different problem which is, in some sense, the inverse of the ‘second dimension of pitch’ problem. Once it is recognized that shifting the position of the fine structure of the spectrum is inherently similar to shifting the position of the envelope of the spectrum, and that the two position variables are different aspects of the same property of sound (acoustic scale), then it seems unreasonable to have one of these variables, Sf, within the realm of timbre and the other, Ss, outside the realm of timbre. For example, consider the issue of voice quality; both of the acoustic scale dimensions affect voice quality and they interact in the production of a specific voice quality (e.g. man, woman, child, dwarf, counter tenor). Moreover, the scale of the source, Ss, affects the perception of the singer’s size, in a way that is similar to the perceptual effect of the scale of the filter, Sf (Smith and Patterson 2005). Thus, if we define the scale of the filter, Sf, to be a dimension of timbre, then we need to consider that the scale of the source, Ss, may also need to be a dimension of timbre. After all, large changes in Ss affect voice quality which is normally considered to be an aspect of timbre.
5.4 The Independence of Spectral Envelope Shape
There is one further aspect of the perception of these melodies that should be emphasized, which is that neither of the acoustic scale manipulations causes a change in the perception of the phonology of the syllables; we always hear ‘pi, pe, ko, kuuu; ni, ne, mo, muuu,’ independent of the VTL and GPR values of the singers. That is, the changes in timbre that give rise to the perception of a sequence of syllables are unaffected by changes in Ss and Sf, even when these scale changes are large (Smith et al. 2005; Ives et al. 2005). The changes in timbre that define the phonology are associated with changes in the shape of the envelope, as opposed to the position of the spectral envelope or the position of the spectral fine structure. Changes in the shape of the envelope produce changes in vowel type in speech and changes in instrument family in music. Changing the position of the envelope and changing the position of the fine structure both produce substantial changes in the relative amplitudes of the components of the magnitude spectrum, but they do not change the timbre category of these sounds, that is, they do not change the vowel type in speech or the instrument family in music.
The ANSI definitions of pitch and timbre are not much help in understanding the perception of musical tones, in the sense of understanding what gives rise to the perception of melody, instrument family and register within a family. The ANSI definitions simply associate both pitch and timbre with unspecified aspects of the frequency content of a sound. In music and speech research, it is traditional to segregate one aspect of the frequency information, namely, F0 (the repetition rate of the sound), from the remainder of the information which is represented by the spectrogram. F0 is then associated with the pitch of the instrument or the pitch of the voice, in the same way that we have associated the scale of the source, Ss, with pitch. Thus, in music and speech research there is, at least, the segregation of the main determinant of pitch from the distribution of frequency information across the acoustic frequency dimension. The difference between these approaches and the acoustic-scale approach presented in this chapter are illustrated in Figure 6. The lower row shows how the frequency information is (or is not) divided up in each case, and the upper row shows the components of auditory perception; the arrows indicate the associations between the components of the frequency information and the components of perception. In the first column, which corresponds to the ANSI definition, there is only one arrow associating all of the frequency content, indiscriminantly, with both pitch and timbre. The second column, corresponding to music and speech research, shows how F0 is segregated from the spectrogram and associated with pitch.
The third column shows how the scale of the source, Ss, and the scale of the filter, Sf, are segregated from the shape of the envelope of the magnitude spectrum in the current approach. The scale of the source is directly related to musical pitch and melody. The shape of the envelope is directly related to the family aspect of timbre, and for the human voice this is further subdivided into different vowel types. These aspects of the mapping from acoustic properties to perceptual variables are straightforward. The mapping between acoustic properties and register within a family is a little more complicated; both of the acoustic scale variables contribute to the perception of register. Both of the acoustic scale variables also contribute to the perception of instrument size and singer size, which are related perceptions in different contexts. It is also the case that the relative magnitude of the acoustic scale variables contributes to our perception of whether a specific instrument is a good, or bad, example of its class. Although the division of frequency information into three components, and the mapping from these components to the perception of musical tones, is somewhat more complicated than in traditional descriptions, it is not excessively complicated, and it does provide for a much better understanding of how the physical properties of instruments, and the acoustic properties of sound relate, to the auditory perceptions that musical tones produce.