The Tonotopic Dimension of the Auditory Image
From CNBH Acoustic Scale Wiki
The first stage of auditory image construction is frequency analysis which occurs in the inner ear, or cochlea. The analysis is performed by the basilar membrane in conjunction with the outer hair cells, and it is this analysis that creates the frequency dimension of the auditory image. The frequency analysis and the basic patterns of motion produced by speech sounds in the cochlea are the main topics of this chapter.
The Outer Ear and Middle Ear
Sound from the air passes through the outer ear and middle ear on its way to the inner ear, or cochlea, and the effects of these external structures on incoming sound is considered here briefly before proceeding to the frequency analysis performed in the cochlea. The outer ear, or pinna, funnels sound to the ear drum, or tympanic membrane, at the inner end of the ear canal. For some animals, like rabbits and horses, the funneling provides useful amplification of incoming sound, but in humans the effect is small. The effect of the middle ear is more important; it focuses the sound energy arriving at the eardrum onto the membrane that forms the entrance to the cochlea, the oval window. The middle ear, from the tympanic membrane to the foot plate of the last middle ear bone, operates like a thumb tack, or drawing pin. The head of the tack is like the tympanic membrane; it collects energy over a relatively wide surface area from a soft source (the thumb or the air). The shaft of the tack is like the middle ear bones; it concentrates the energy onto a smaller area that pushes on a more resistant material (a pin board or the oval window of the cochlea). In the process, the middle ear matches the relatively low "impedance" of sound in air to the relatively high impedance of sound in cochlear fluids. The concentration of sound by the middle ear makes the auditory system of land animals much more sensitive than it would otherwise be and, as a result, the development of the middle ear was a very important step in the evolution of hearing by land animals (Ref to Manley SHAR book). However, aside from increasing sensitivity, the middle ear (and the outer ear) do not play a major role in determining the pattern of activity observed in the neural representation of sound, and so they do not have a large effect on the form of auditory perceptions, aside from their loudness.
For humans, the most noticeable effect of the middle ear is that limitations on the motion of the middle ear bones limits the range of frequencies that we can hear from about 16 Hz to 16,000 Hz. The effect of these frequency limits is illustrated in Figure 2.2.1 for a click train with a period of 8 ms; it has a pitch of 125-Hz, which is close to the pitch of the note B2 an octave and a semitone below middle C on the keyboard. The upper panel shows the pressure wave for the click train in air; the lower panel shows the pressure wave at the input to the cochlea. The frequency limits imposed by the middle ear bones appear as a brief decaying oscillation following each of the individual clicks. There is no change to the basic form of the energy; it is still a pressure-time wave as it enters the cochlea.
The sounds of speech music are, not surprisingly, composed of frequencies that fall well within the range perceived by people with normal hearing, and for such sounds, the motion of the oval window at the entrance to the cochlea is always fairly similar in form to the wave entering the pinna. The waveform of the vowel in the word ‘hat’ is presented in Figure 2.2.2a. It is a classic pulse-resonance sound consisting of a train of rounded clicks, or glottal pulses, with fairly large, damped oscillations between them. The period of the wave is just over 8 ms and so it produces a pitch that is just a little lower than that produced by the click train in Figure 2.2.1. The motion produced at the oval window by the vowel sound is presented in Figure 2.2.2b, and it is quite similar to the motion of the vowel in air. The vowel is composed of components that are almost entirely within the frequency range of the middle ear and so the small changes observed largely reflect small phase changes introduced in transmission to the cochlea. The phonentic symbol for the vowel in hat is /ae/. The 8-ms click train and the 8-ms /ae/ vowel are used in this part of the book to illustrate the succession of internal representations that we think are produced as the auditory systems constructs the space of auditory perception in which temporally regular sounds like the click train and the vowel appear in a stabilized form.
The operation of the outer and middle ear are well documented in many texts; for example, Pickles (1988, Chapter 2) and Lutmann (xxx), Manley (xxx SHAR).
Frequency Analysis and Auditory Filterbanks
The first major transformation applied by the auditory system to incoming sound is the frequency analysis performed by the basilar partition in the cochlea. The structure of the cochlea is like that of a snail -- a tapering tunnel enclosed in a rigid spiral case (ref to an illustration, xxx). In humans, the length of the tunnel in the shell is about 35 mm. The basilar partition runs down the middle of the tunnel, dividing it into two parts as the name suggests. The partition is composed of a pair of membranes (the basilar membrane and the tectorial membrane) and two strips of "hair" cells which are sandwiched between the membranes (the outer haircells and the inner haircells). When sound enters the cochlea, it causes the basilar partition to vibrate. At the opening to the cochlea (referred to as the base), the partition is thin and the tunnel is relatively wide. As the partition proceeds to wards the tip of the snail shell (referred to as the apex) the partition becomes thicker and the tunnel narrows. The progression in these physical properties means that high-frequency components of a sound cause greater motion of the partition near the base of the cochlea where the sound comes in, and low-frequency components cause greater motion of the partition towards the apex. The result is that the frequency components of a sound are distributed along the length of the partition, roughly in accordance with the logarithm of their frequency. Descriptions of cochlear frequency analysis are presented in many hearing texts; a detailed discussion of the physiology is presented in Chapter 3 of Pickles (1988); more compact summaries with a psychophysical perspective are presented in the introductory texts of Moore (2003), and Yost (1994).
Physiologists have made enormous advances in the past decade in the details of how the basilar membrane and outer haircells perform frequency analysis; for a review see xxx SHAR vol xxx). As yet, however, accurate models of basilar membrane motion tend to be restricted to describing the motion at a single point on the basilar membrane in response to fairly simple sounds like sinusoids. Moreover, the equations of motion are complex and, even with modern computers, the computational load of a physiologically accurate model is prohibitive. For perceptual research, we need a functional model of auditory frequency analysis that can process extended recordings of complex sounds like speech and music and support models of perceptual processing beyond the cochlea, such as those used for speech recognition or music transcription. Typically, the tradeoff between physiological fidelity and manageable computational load leads to the use of an ‘auditory filterbank’ to simulate cochlear frequency analysis. In this case, the motion at a particular point on the basilar partition is simulated by a digital filter whose centre frequency is the same as the frequency associated with that point on the partition, and whose frequency selectivity is a reasonable approximation to the selectivity at that point on the partition. The filter is most sensitive to energy at its centre frequency, and the amplitude of its response drops away as the frequency of the energy in the sound deviates from the centre frequency of the filter. The function describing the relationship between the input to a filter and its output is referred to as its "transfer function." It has a spectral form which shows the magnitude of the output of the filter as a function frequency in response to an input of fixed magnitude, and it is referred to as the filter's "spectral magnitude function". It also has a time-domain form which shows how the filter responds immediately after being excited by an acoustic impulse, and this aspect of the transfer function is referred to as the filter's "impulse response."
The selectivity of an auditory filter is typically summarized in terms of its bandwidth (BW), and the ratio of the bandwidth to the centre frequency of the filter. The ratio is referred to as the filter’s “quality” (Q). For auditory filters, the bandwidth is on the order of 10% of the centre frequency at low sound levels, so Q is on the order of 10. In order to provide a complete frequency analysis that reveals all of the components of a complex sound at their respective levels, the bank of filters used to analyse the sound must have filters whose centre frequencies are spaced no more than one bandwidth apart. If the filterbank is to cover the frequency region of speech and music, from about 100 to 6000 Hz, the filterbank must have about 40 filters, and if the output is intended to present a surface that describes basilar membrane motion, it is better if the filter bank has two filters per bandwidth rather than one. Accordingly, auditory filterbanks used to represent human frequency analysis of speech and music sounds typically have around 75 filters covering the range 100 to 6000 Hz. Each of these filters represents about 0.45 mm of basilar membrane length, since the length of the basilar membrane in humans is about 35 mm. Hartmann (1997) provides an accessible introduction to digital filters for auditory research.
There is relatively simple expression, referred to as the gammatone function Aertsen and Johannesma (1980), that provides a reasonably accurate representation of auditory filtering about a given frequency, and banks of these gammatone filters, tuned to frequencies across the speech range, are often used to simulate cochlear frequency analysis. The "gammatone auditory filterbank" of Patterson et al. (1992) will be used to illustrate the principles of auditory frequency analysis, specifically by comparing the response of the gammatone auditory filterbank to the click train of Figure 2.2.1 with the response to the vowel of Figure 2.2.2. The mathematical definition of the gammatone filter is presented in the next section, along with the motivation for this specific family of auditory filters.
The gammatone is a linear, that is, it does not change its properties as a function of sound level; moreover, the frequency response of the gammatone filter is effectively symmetric. The human auditory filter does vary with level (Glasberg and Moore, 1990) and at high levels the tails of the frequency response become markedly asymmetric (Rosen and Smith, 1988; Rosen et al., 1992) with the tail on the low-frequency side of the filter applying much less attenuation than the the tail on the high-frequency side. The auditory filter also applies strong compression to frequencies components that fall within its passband. As a result, the use of the gammatone function as an auditory filter is limited to broadband sounds that are not too loud. The final section of the Chapter describes more recent attempts to develop a level-dependent version of the gammatone filter, to encompass a wider range of physiological and psychophysical data. It is referred to as the "compressive gammachirp" auditory filter (Irino and Patterson, 1997; Irino and Patterson, 2001; Patterson et al., 2003; Unoki et al., 2006).
Basilar membrane motion in response to a click train
The output of the gammatone filterbank in response to a simple click train is shown in Figure 2.2.3. The period of the click train is 8 ms, as before, and it produces a pitch of 125 cps (B2). The parameters of the filterbank have been set to values appropriate for a young adult with normal hearing. The abscissa of the figure is time in milliseconds (ms). Each of the fine lines in the figure shows the waveform at the output of an individual auditory filter when it is driven by a segment of click train 24 ms in duration. Together the set of filter outputs define a surface which represents the motion of the basilar partition as a function of time in response to this stimulus. Note that the ordinate is compound; there is an explicit ordinate and an implicit ordinate. The explicit ordinate is filter centre frequency in ERB's; it specifies the centre frequencies of the individual filters on an auditory frequency scale that is explained a little later in the chapter. The implicit ordinate is the amplitude of the output of the filter; the amplitude is shown by the oscillation of the individual lines but the level is not overtly labelled. Although the amplitude ordinate of the filtered waves is implicit, the relative levels of the channels are strictly preserved in this and subsequent figures showing basilar membrane motion.
The surface in Figure 2.2.3 illustrates the basic properties of basilar membrane motion when the stimulus is a periodic sound. In the high-frequency channels where the filters are broad, the individual clicks of the click train produce responses that are effectively filter impulse responses, which die away and leave a region of quiescence before the next click of the train arrives. As the centre frequency of the filter decreases, the filter bandwidth decreases and the impulse response gets longer and longer. Eventually, it reaches the point where the filter is still ringing when the next click in the train arrives, and when this happens, the tail of the response to the earlier click interacts with the start of the response to the later click. In the current example, this occurs about half way down from the top of the figure.
The output of five of the channels from Figure 2.2.3, from across the frequency range, are reproduced in Figure 2.2.4 to illustrate how and when the tail of one impulse response interacts with the start of the next. In the upper two channels, where the centre frequency is above 2 kHz, there is no interaction; the response to one click dies away before the next click arrives. In this case, the response in each channel to each click is that filter's impulse response. As the centre frequency and filter bandwidth decrease, and the click responses begin to interact, the output of the filter comes to look like a sinusoid with asymmetric amplitude modulation (the middle channel). As the centre frequency and bandwidth decrease beyond this point, the depth of modulation between the peaks of successive click responses decreases and the modulation becomes more symmetric. Finally, in the region of the lowest three harmonics, where the filter bandwidth is narrow with respect to the spacing of the harmonics, the filter output is effectively a sinusoidal wave with little or no modulation.
Finally, with regard to the lowest harmonics, consider the intriguing question as to how the basilar membrane -- a continuous structure -- manages to go up and down at the rate dictated by one harmonic in one region of the cochlea and, at the same time, go up and down at a different rate in a nearby region, as dictated by an adjacent harmonic. The filterbank can be used to examine the microstructure of the transition from one isolated harmonic to another. Figure 2.2.5 shows a 16-ms segment of the click train response in the frequency region from about 600 to 1000 Hz, which shows the response to harmonics 5, 6, 7 and 8 of the click rate, 125 cps. The figure shows that the filters respond to the first click of the train by ringing at their own centre frequency and the peaks of the output waves form continuous ridges that slant towards the right as centre frequency decreases. When the next click arrives, there is a complicated adjustment which takes a different form in each channel and which produces a saddle structure in the region between harmonics. The figure shows that for harmonics 5 and above, the response throughout much of the cycle reflects the centre frequency of the filter rather than the frequency of the nearest harmonic, and at the end of the cycle there is an abrupt adjustment to get back in step with the click train.
The click train is a good sound for illustrating the operation of the cochlea because it excites all of the channels and reveals how simple impulse response interact across the membrane -- effects which are not observed in the response to individual sine tones. The click train response reveals the underlying structure that forms the basis of the response of the system to all pulse-resonance sounds. Now consider the basilar membrane motion that occurs in response to a vowel sound and how it compares to the click train response of Figure 2.2.3.
Basilar membrane motion in response to a vowel
The simulation of basilar-membrane motion produced with a gammatone filterbank in response the vowel in ‘hat’ is shown in Figure 2.2.6. The phonetic symbol for the vowel is /ae/. The regularly spaced glottal pulses that excite the vocal tract are like inverted, somewhat rounded versions of the clicks in the click train. In this example, the inter-pulse time is just over 8 ms and the repetition rate of pattern that appears in the basilar membrane motion (Figure 2.2.6) is essentially the same as repetition rate of the click response in Figure 2.2.3. Like the click train, the vowel excites a wide range of auditory filters and the pattern within the period is strongly asymmetric in time; there is an abrupt onset of activity in response to the pulse and then a progressive decay over the course of the period.
The vowel and the pattern of membrane motion differ from those of the click train because the glottal pulses of the vowel pass through the resonances of the vocal tract on their way out to the air and these vocal resonances reinforce activity at some frequencies while attenuating activity at other frequencies, as can be observed in Figure 2.2.6. The reinforced regions are referred to as 'formants' and they appear as triangular features in the upper half of Figure 2.2.6, where the resonance extends the ringing of the auditory filters across a range of channels. Between the formants, the interaction of the resonances actually causes an attenuation of the natural ringing of the auditory filters. As frequency increases, the formants are observed to ring for a shorter duration. On this quasi-logarithmic frequency axis, the width of the formants in frequency is approximately the same. Vowels are largely distinguished by the position of the formants in frequency, the level of the formant peaks, and the shape of the formant in time. The features that distinguish vowel type (i.e. /a/, /e/, /i/, /o/, /u/ etc) are described towards the end of this chapter, following the description of the details of the gammatone auditory filterbank.
Finally, returning briefly to the click response in Figure 2.2.3, note that there is a minor lengthening of the responses in channels with centre frequencies in the range of 24 ERBs. This is because there is a broad resonance in the ear canal which emphasizes energy in the region of 2.5 to 3.0 kHz.
- Appendix 2.2.A describes the The Roex Auditory Filterbank.
- Appendix 2.2.B describes the The Gammatone Auditory Filterbank.
- Appendix 2.2.C describes the The Gammachirp Auditory Filterbank.
- Aertsen, A. and Johannesma, P. (1980). “Spectro-temporal receptive fields of auditory neurons in the grassfrog. I. Characterization of tonal and natural stimuli.” Biol. Cybern, 38, p.223-234. 
- Glasberg, B.R. and Moore, B.C.J. (1990). “Derivation of auditory filter shapes from notched-noise data.” Hear. Res., 47, p.103-138. 
- Hartmann, W.M. (1997). Signals, Sound, and Sensation. (AIP Press). 
- Irino, T. and Patterson, R.D. (1997). “A time-domain, level-dependent auditory filter: The gammachirp.” J. Acoust. Soc. Am., 101, p.412-419. 
- Irino, T. and Patterson, R.D. (2001). “A compressive gammachirp auditory filter for both physiological and psychophysical data.” J. Acoust. Soc. Am., 109, p.2008-2022. 
- Moore, B.C.J. (2003). An Introduction to the Psychology of Hearing. (Academic Press). 
- Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C. and Allerhand, M. (1992). “Complex Sounds and Auditory Images”, in Auditory Physiology and Perception, Y Cazals L. Demany and Horner, K. editors (Pergamon Press, Oxford). 
- Patterson, R.D., Unoki, M. and Irino, T. (2003). “Extending the domain of center frequencies for the compressive gammachirp auditory filter.” J. Acoust. Soc. Am., 114, p.1529-1542. 
- Pickles, J.O. (1988). An Introduction to the Physiology of Hearing. (Academic Press). 
- Rosen, S., Baker, R.J. and Kramer, S. (1992). “Characterizing changes in auditory filter bandwidth as a function of level”, in Auditory Physiology and Perception, Cazals, Y., Horner, K. and Demany, L. editors, p.171-177 (Pergamon Press). 
- Rosen, S. and Smith, D.A.J. (1988). “Temporally-based auditory sensations in the profoundly hearing-impaired listener”, in Basic issue in hearing, Duifhuis, H., Jorst, J.W. and Wit, H.P. editors, p.431-439 (Academic). 
- Unoki, M., Irino, T., Glasberg, B., Moore, B.C. and Patterson, R.D. (2006). “Comparison of the roex and gammachirp filters as representations of the auditory filter.” J. Acoust. Soc. Am., 120, p.1474-1492.