The Gammatone Auditory Filterbank
From CNBH Acoustic Scale Wiki
The gammatone auditory filter is defined in the time domain and it is essentially a section of a cosine wave, cos(2πfct + Ø), whose rate of onset is specified by a power function, t(n-1), and whose rate of offset is determined by a decaying exponential function, e-2πbt. Thus the gammatone filter, gt(t), is,
- gt(t) = a . t(n-1) .cos(2πfct + Ø) . e-2πbt. (t > 0)
The frequency of the cosine wave, fc, is set to the centre frequency of the auditory filter. The first term, a, is simply a scalar that specifies the gain of the filter. The terms are normally rearranged as follows to emphasize the envelope and the carrier of the impulse response.
- gt(t) = a . [t(n-1)e-2πbt] . cos(2πfct + Ø), (t > 0) [Eq. 2.2.1]
The term in the square brackets is the envelope and it is the gamma function from statistics; the cosine term provides the fine structure of the impulse response. When the impulse response is convolved with a waveform, the wave at the output of the convolution emphasizes frequencies in the region of fc and it attenuates activity progressively as its frequency deviates from fc. Since the cosine term sounds like a tone when presented as an acoustical wave, the function is referred to as a "gammatone filter" (Aertsen and Johannesma, 1980). When the parameters of the variables in the function are chosen to reflect the operation of the cochlea at a point along the basilar membrane, then it is referred to as a "gammatone auditory filter".
The original motivation for the gammatone function as model of auditory frequency selectivity was threefold: physiological, psychological and practical.
1. Physiological: The gammatone function provides an excellent fit to the impulse response of the basilar membrane measured physiologically in cats. The physiological impulse response is obtained with the revcor technique developed by de Boer and described in detail in de Boer and de Jongh (1978). Briefly, the cat is presented with a wideband noise and the response of a primary fiber in the auditory nerve is recorded with a micro-electrode. The noise waveform is then correlated with the stream of neural impulses that constitute the response to the noise, and the result of this "reverse correlation," or "revcor", provides a measure of the impulse response of the basilar partition at the point where the primary fibre is located. A concise description of the technique appears in Pickles (1988, pp 95-99). Carney and Yin (1988) fitted the gammatone function to "revcor" data from more than 150 individual fibers in cats and showed that the gammatone function does indeed provide a very good fit to "revcor" data over a wide range of centre frequencies and levels.
The dynamic range of the "revcor" technique is limited to about 25 dB and so it does not provide reliable information about the tails of the filter outside the passband. However, for everyday sounds, the output of the filterbank is largely determined by the shape of the passband of the filter, and so it is applicable to a large range of sounds. The revcor technique has the distinct advantage of being able to measure the passband of the filter at the stimulus levels where we listen to music and speech. Both physiological data (Evans, 1977; Carney et al., 1999) and psychological data (Unoki et al., 2006) indicate that the passband of the auditory filter is reasonably independent of level. The revcor technique also has the advantage of eliciting the data with a sound that has a uniform distribution of energy. The tuning curve is elicited with a point source, a sinusoid, whose level is confounded with its distance from the centre frequency of the filter.
2. Psychological: The amplitude characteristic of the passband of the gammatone filter is very similar to that of the roex filter commonly used to summarise frequency-selectivity data measured psychoacoustically in humans. [[[work mark Sat, 8 May 2010]]]
The amplitude characteristic of the human auditory filter is commonly measured with a notched-noise technique. The listener is required to detect a sinusoidal signal presented with a broadband noise masker which has a deep notch in the region of the signal. Signal threshold is measured as the width of the notch is varied to assess the selectivity of the auditory filter centred on the signal (Patterson, 1976). Patterson and Nimmo-Smith (1980) showed that the shape of the auditory filter is well described by a pair of back-to-back exponential functions, if the sharp peak at the centre frequency of the function is rounded off, and the sharp descent of the exponentials is rounded up at frequencies outside the passband. They developed a family of rounded-exponential filters which have, subsequently, been used to predict noise masking over a wide range of frequencies and levels in simultaneous and forward masking conditions; for a review see Patterson and Moore (1986).
In order to implement an auditory filterbank, one must have a phase characteristic as well as an amplitude characteristic for the filters. Although the notched-noise technique provides a good measure of the amplitude characteristic of the auditory filter, it does not provide any information concerning the phase characteristic. In an effort to overcome this problem, Schofield (1985) noted that the amplitude characteristic of the gammatone filter provided a good fit to the data from the original notched-noise experiment (Patterson, 1976), and suggested the gammatone filter as a model of the human auditory filter. The gammatone filter has a 'minimum' phase characteristic which seemed a reasonable assumption for the human auditory system at the time. It now seems clear that the phase characteristic is not strictly minimum phase (Kohlrausch and xxx, 1992) but it remains a reasonable assumption in a wide range of conditions. Following Schofield's lead, Patterson, Nimmo-Smith, Holdsworth and Rice (1988) compared the amplitude characteristic of the gammatone filter with that of the most comprehensive roex filter, roex(p, w, t). They found that a gammatone with a low order (2-3) provides the best fit over a large dynamic range (60 dB) but that a gammatone with a slightly higher order (n=4) provides the best fit to the passband of the roex filter which is more important for explaining masking in notched noise.
3. Computational: There is a very efficient recursive, digital filter which is highly stable and which provides a particularly good approximation to the gammatone filter both in amplitude and phase.
While investigating the form of the gammatone filter in the frequency domain Holdsworth realised a ) that an nth-order gammatone filter can be approximated by a cascade of n, identical, first-order gammatone filters, and b) that the first-order gammatone filter can be approximated by a particularly efficient, recursive digital filter. The implementation of the gammatone filterbank in AIM is described in Holdsworth, Nimmo-Smith, Patterson and Rice (1988). On average, across the frequency range of speech, the recursive gammatone filter is about an order of magnitude quicker than convolution of the sound with the gammatone impulse response. The computational load of AICAP is dominated by the filterbank; the combined load of the remaining stages is less than that of the filterbank; so an order of magnitude saving in the filterbank stage has a large effect on the overall performance of the model. Extensive reviews of the gammatone function as a filter are presented in Slaney (1993 xxx), Darling (199 xxx ). The relationship between the gammatone filter and cochlear mechanics is described in and Lyon (1996) xxx.
The Bandwidth of the Auditory Filter
There is a wealth of information in the literature concerning the bandwidth of the roex auditory filter: it increases monotonically with filter centre frequency and it is greater at high stimulus levels; it increases slowly with age and it is broader in listeners with hearing impairment of cochlear origin. A review of the roex filter in simultaneous masking is presented in Patterson and Moore (1986). The main effects are similar in forward masking conditions but the bandwidth is usually found to be a little narrower than in simultaneous masking. A review of the roex filter in forward masking is presented in O'Loughlin and Moore (1986). Glasberg and Moore (1990) have recently reviewed existing data on the roex filter for normal listeners and concluded that there is a broad middle range of stimulus levels and ages where the relationship between the Equivalent Rectangular Bandwidth of the filter and its centre frequency is well represented by
ERB = 24.7 + 0.108 . fc (2)
In words, the bandwidth is roughly 25 Hz plus a little over 10% of the centre frequency. The relationship can also be written as
ERB = 24.7 + fc/9.65 (2a)
to stress the fact that the auditory system is nearly a 'constant Q' system, that is, a system in which the bandwidth is a fixed proportion of the centre frequency. Physical systems often have this characteristic. In engineering texts the proportion, Q, is specified as fc/bandwidth so that more selective systems are associated with larger Q's. Thus, in engineering terms, the auditory filterbank is a constant Q system with a restriction on minimum bandwidth at low centre frequencies, and the characterisitic of the system, Q, has a value of about 10.
Returning to the gammatone filter, Holdsworth et al (1988) have shown that, when the centre frequency is large relative to the bandwidth (which it is in the case of the auditory system), the bandwidth of the gammatone filter is proportional to b, the decay parameter in the exponential term of Equation 1, and the proportionality constant, a, depends solely on n, the order of the filter. That is,
ERB = anb (3)
When the bandwidth of the gammatone filter is matched to that of the roex filter, the amplitude characteristcs of their passbands are essentially indistiguishable. Thus, to tune the gammatone filterbank for use with normal human listeners we need only calculate b from equations 2 and 3. Specifically,
b = 24.7/an + 0.108.fc/an (4)
Holdsworth et al (1988) provide an analytical expression for the ERB of the gammatone filter and a table of proportionality constants for n in the range 1-9. When the order is 4, an is 0.982, b is 1.019 ERB, and
b = 25.2 + 0.110.fc (4a)
Holdsworth et al (1988) also provide an analytical expression for the 3-dB bandwidth of the gammatone filter, and proportionality constants for calculating b from 3-db bandwidths. When the order is 4, the 3-dB bandwidth of the gammatone filter is 0.87 times the ERB. Equations 2 and 4a provide a complete specification of a gammatone filterbank for order 4, if we include the common assumption that the filters are distributed across frequency in proportion to filter bandwidth. The resulting gammatone filterbank will predict threshold for signals masked by stationary noises in the majority of cases encountered in the everyday world.
The frequency dimension of auditory representations
The ordinate on the figures portraying simulated basilar motion specifies the centre frequencies of the gammatone filters in the filterbank. Specifically, the centre frequency is the value where the zero line of the filtered wave intersects the ordinate. The unit of measurement on the frequency axis is ERB's, so the the auditory filters are distributed across frequency in proportion to their bandwidth. This is a traditional assumption introduced by Fletcher (1953) and Zwicker et al. (1957). It is based on physiological data relating the frequency of a sinusoidal stimulus to the position of maximum response on the basilar partition, and to the selectivity of the partition at that point. Greenwood (1961) reviewed the early frequency/position and selectivity data for mammals ranging in size from mice, to humans, to elephants and concluded that the integral of the critical-band curve did, indeed, provide a good fit to the frequency/position data. As a result, he used the existing selectivity data to obtain his now classic frequency/position curve.
Subsequently, Zwicker and Terhardt (1980) and Moore and Glasberg (1983) reviewed sets of selectivity data for humans and integrated their respective versions of the critical-band scale to generate what are referred to as 'critical-band-rate' functions; that is, Bark-rate and ERB-rate functions that relate auditory-filter centre frequency to frequency. These critical-band-rate functions provide, arguably, the best scales for the frequency dimension in auditory representations of sounds. Recently, Greenwood (1990) updated his review of mammalian data on frequency/position and selectivity, and the paper includes a section devoted to the new bandwidth estimates obtained with humans in filter-shape experiments. Greenwood concludes that the ERB-rate scale provides a slightly better fit to the physiological data. He also agrees with Moore and Glasberg (1983) that, in humans, each ERB corresponds to about 0.9 mm of length along the basilar partition.
Glasberg and Moore (1990) have also updated their review of auditory filter bandwidths with new data at low and high centre frequencies, and this has led them to remove the quadratic term from their earlier ERB function. The result is the very simple crtical-band function presented in Equation (2) which they integrated to obtain the following ERB-rate function
It is this frequency scale that appears on the ordinate in figures displaying output from the normal gammatone filterbank.
The response of a filterbank with Bark-scale bandwidths
For historical reasons, the Bark scale is probably still the most commonly used critical-band function. It is succinctly described in Zwicker (1961) and justified in Zwicker and Terhardt (1980). A reasonable approximation to the Bark scale can be achieved by setting the minimum bandwidth to 80 Hz and the Q to 6.5 xxx in Equation (2a). Thus, the Bark scale and the ERB scale have basically the same form, but the Bark-scale bandwidths are about double xxx the ERB-scale bandwidths for centre frequencies below 1.0 kHz, and about 1.5 xxx times ERB-scale bandwidths in the region above 1.0 kHz. At first glance these bandwidth differences might seem rather small. However, if a gammatone filterbank is assigned Bark-scale bandwidths the output of the filterbank can be substantially different from that produced by the ERB-scale gammatone. When the 8-ms click train is presented to an auditory filterbank with Bark-scale bandwidths, the result is as shown in Figure 1.9. A comparison with the response of the ERB-scale filterbank (Figure 1.3) reveals that, whereas there are two resolved harmonics in the output of the Bark-scale filterbank, there are four to five resolved harmonics in the output of the ERB-scale filterbank. Furthermore, the region of isolated impulse responses extends down to about xxx 500? xxx Hz in the case of the Bark-scale filterbank. These differences are a simple illustration of the tradeoff between filter bandwidth and the duration of the impulse response in a filter system; as the bandwidth of the filter increases, the duration of the impulse response decreases. However, psychoacoustical data indicate that normal listeners resolve the first five to ten harmonics of a complex sound (Plomp xxx Zicker xxx Moore xxx). Here, then, is a time-domain indication that the Bark scale underestimates frequency selectivity to an unacceptable degree. Figure 1.9. The response of a Bark-scale gammatone filterbank to an 8-ms click train. The response of the Bark-scale filterbank to the vowel /ae/ is shown in Figure 1.10. A comparison with the response of the ERB-scale filterbank (Figure 1.7) shows that the formants are in slightly different positions and that the Bark-scale formants have somewhat more energy than their ERB-scale counterparts. Neither of these differences is of any importance. Howerver, there are other differences and they are more important. In particular, the formants have changed shape. In the case of the second and third formants, they have simply become a little broader in frequency and shorter in time, as expected. However, the fourth formant has become so short in the Bark-scale response that it is difficult to distinguish it from the response to the glottal pulse that runs vertically between the upper formants. The largest change is actually in the first formant: in the ERB-scale response, the first formant is represented by two resolved harmonics; in the Bark-scale response, the harmonics in the first formant are not resolved and the formant has a triangular shape that is more typical of the second and third formants. Figure 1.10. The response of a Bark-scale, gammatone filterbank to the vowel /ae/. The Bark scale also leads to excessive estimates of masked threshold. Consider the case where a 1.0 kHz sinusoid is masked by a notched noise. Patterson et al (1982) calculated threshold as a function notch width for this experiment using a roex filter with a normal bandwidth and one with double that bandwidth. When the notch about the tone is narrow the wide filter gives a threshold estimate 3 dB greater than the narrow filter -- a relatively small discrepancy. But as the notch widens to 800 xxx Hz, the discrepancy grows to over 10 dB. In retrospect, it seems that the component interactions used to define the Bark scale measure something closer to the 10-dB bandwidth of the filter, rather than the ERB or the 3-dB bandwidth. At very high stimulus levels, there is a well-known upwards spread of masking (Wegel and Lane, 1928; Egan and Hake, 1954). These levels are beyond those encountered in everyday life and are so uncomfortable as to cause most people to find some way of reducing the level at the ear, either by donning ear defenders or removing themselves from the situation. It should also be noted that, in these early studies, masked audiograms were plotted on a logarithmic frequency scale which accentuates the apparent effect of low-frequency maskers by stretching the low-frequency part of the axis relative to higher frequencies (Patterson, 1974). At loud, but not extreme levels, the upwards spread of masking is more properly associated with a breakdown of the frequency selectivity of the lower tail of the auditory filter rather than the lower skirt of the passband (Lutfi and Patterson, 1984 xxx). There is some broadening of the passband on the low-frequency side but not sufficient to account for a doubling of filter bandwidth nor the upward spread of masking as traditionally conceived. In summary, it is ususally reasonable to use the normal gammatone filterbank for everyday sounds. The order of the gammatone filter: The order of the gammatone function, n, determines the number of filtering stages and, thus, the slope of the skirts of the attenuation characterisitic. In the examples, the order is 4; the range of useful values is from about 2 to 8 for human hearing. The processing time increases in a linear fashion from order 3. Increasing the order of the filter increases the delay of the onset of the impulse response but it has little effect on the shape of the envelope of the impulse response for orders greater than three. Humans are not sensitive to small monaural, phase changes between filter channels (Patterson, 1987) and so filter order is not well constrained by human experimental data. We use order 4 in most cases because this value provides the best match between the amplitude characteristics of the gammatone and roex filters. In addition, physiological data from de Boer and xxx (see the UCL Darling xxx paper for the ref) indicates that the phase characteristic of the auditory filter is closer to that of the order 4 gammatone.
The Fidelity/Useability Tradeoff in the Spectral Analysis: The origins of AICAP should be introduced vry briefly at the end of the Intro, to establish the concept of a psychological model of hearing, and it should be noted that we will briefly discuss the fidelity/useability tradeoff at the end of each chapter. The auditory filterbank is a classic functional model; it is a signal processing module that transforms a sound into an output that is similar to the output that our physiological models tell us would come from the basilar membrane and outer hair cells in response to that sound. It performs a function that is similar to that performed by the basilar membane, although the internal architecture of the auditory filterbank, and our conception of spectral analysis as the action of a set of digital filters, are radically different from the physiological processes they simulate. Thinking physiologically, sounds produce a travelling wave in the basilar partition. In this case, it is natural to think of the system's response in terms of partition displacement as a function of distance along the partition, and to employ a displacement/position representation of the response. Thinking functionally, the system breaks a sound down into its frequency components. In this case, it is natural to think of the system's response in terms of a spectrally ordered set of waves, and to use an amplitude/time representation of the response.
The gammatone auditory filterbank is particularly appropriate for simulating the cochlear filtering of broadband sounds like speech and music provided the sound level is in the broad middle range of hearing. The gammatone is a linear filter and the magnitude characteristic is approximately symmetric on a linear frequency scale. The auditory filter is roughly symmetric on the same scale when the sound level is moderate; however, at high levels the highpass skirt of the filter becomes shallower and the lowpass skirt becomes sharper. For broadband sounds, the shape of the surface of the filterbank output is largely determined by sound energy that passes through the passbands of the individual auditory filters. In this case, the effect of filter asymmetry is only noticeable at the very highest levels (over 85 dBA) where it causes a gradual smearing of the surface features. For narrowband sounds, the precise details of the filter shape can become important. For example, when a tonal signal is presented with a narrowband masker some distance away in frequency, the accuracy of the simulation will deteriorate as the frequency separation increases.
This is where we put the Transmission-line filterbank (Lyon, 1982). It provides a great alternative which was better motivated at the time it was introduced but which now needs modification.
- Fletcher, H. (1953). Speech and hearing in communication. (Van Nostrand). 
- Glasberg, B.R. and Moore, B.C.J. (1990). “Derivation of auditory filter shapes from notched-noise data.” Hear. Res., 47, p.103-138. 
- Greenwood, D.D. (1961). “Critical bandwidth and the frequency coordinates of the basilar membrane.” J. Acoust. Soc. Am., 33, p.1344-1356. 
- Greenwood, D.D. (1990). “A cochlear frequency-position function for several species - 29 years later.” J. Acoust. Soc. Am., 87, p.2592-2605. 
- Irino, T. and Patterson, R.D. (1997). “A time-domain, level-dependent auditory filter: The gammachirp.” J. Acoust. Soc. Am., 101, p.412-419. 
- Irino, T. and Patterson, R.D. (2001). “A compressive gammachirp auditory filter for both physiological and psychophysical data.” J. Acoust. Soc. Am., 109, p.2008-2022. 
- Moore, B.C.J. and Glasberg, B.R. (1983). “Suggested formulae for calculating auditory-filter bandwidths and excitation patterns.” J. Acoust. Soc. Am., 74, p.750-753.  
- Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C. and Allerhand, M. (1992). “Complex Sounds and Auditory Images”, in Auditory Physiology and Perception, Y Cazals L. Demany and Horner, K. editors (Pergamon Press, Oxford). 
- Patterson, R.D., Unoki, M. and Irino, T. (2003). “Extending the domain of center frequencies for the compressive gammachirp auditory filter.” J. Acoust. Soc. Am., 114, p.1529-1542. 
- Unoki, M., Irino, T., Glasberg, B., Moore, B.C. and Patterson, R.D. (2006). “Comparison of the roex and gammachirp filters as representations of the auditory filter.” J. Acoust. Soc. Am., 120, p.1474-1492. 
- Zwicker, E., Flottorp, G. and Stevens, S.S. (1957). “Critical bandwidth in loudness summation.” J. Acoust. Soc. Am., 29, p.548-557. 
- Zwicker, E. and Terhardt, E. (1980). “Analytical expressions for critical band rate and critical bandwidth as a function of frequency.” J. Acoust. Soc. Am., 68, p.1523-1525.