The Location of the Auditory Image
From CNBH Acoustic Scale Wiki
BINAURAL AUDITORY IMAGES FOR NOISE-RESISTANT SPEECH RECOGNITION
Roy D. Patterson, Timothy R. Anderson and Keith Francis
INTRODUCTION In the auditory system, the primary fibers that encode the mechanical motion of the basilar partition are phase locked to that motion, and auditory processing in the mid-brain preserves this information, to varying degrees, up to the level of the inferior colliculus. We know that this timing information is used in the localisation of point sources (Blauert, 1983; Stern and Colburn, 1978) and it is probably also used to separate point sources from diffuse background noise. The temporal resolution of this neural processing is on the order of tenths of milliseconds and so traditional speech preprocessors with frame durations on the order of 15 milliseconds, remove the fine-grain temporal information on which the processing is based. The performance of these systems deteriorates badly when the speaker is in a noisy environment with competing sources. This suggests that we will eventually need to extend the temporal resolution of speech recognition systems if we are to achieve the kind of noise resistance characteristic of human speech recognition. In this paper, we describe a) an auditory model designed to stabilize repeating patterns of phase-locking information (Patterson et al., 1995), b) a method of extending the model to binaural processing, c) the 'data-rate problem' associated with auditory models as speech preprocessors, d) the construction of a noise resistant binaural auditory spectrogram for speech recognition.
1. AUDITORY IMAGES AND THE SPACE OF AUDITORY PERCEPTION When an event occurs in the world around us, we experience an auditory image of the event, in the same way that we experience a visual image of the event. The auditory image reveals the pitch and loudness of the source and its sound quality, or timbre. These latter properties help us identify the speaker and their voice quality. The Auditory Image Model was designed to simulate the construction of our auditory images and the space of auditory perception in which they exist. The construction of auditory images occurs in three stages, each of which filters the sound in some way and arranges the products of the filtering process along a spatial dimension to produce the space of auditory images shown in Figure 1. It is argued that this is similar to the basic space of auditory perception. The structures that sounds produce in this space are thought to be the first internal representation of the sound of which we can be aware, and the representation that serves as the basis for all subsequent forms of auditory pattern recognition.
1.1 Separation by Frequency and Arrival Time
The first two stages simulate the frequency analysis performed in the cochlea and the laterality analysis performed in the mid brain (Patterson, 1994a; Blauert, 1983), and they produce something like the two-dimensional frequency/laterality plane illustrated in Figure 1. This is the traditional overview of spectral and binaural processing and so the description is limited to an introduction of the schematic representation of these processes. In the first stage, sound flows through the outer ear into the cochlea where it is filtered by frequency and transformed into a neural activity pattern (NAP). The channels of the NAP are set out along a frequency dimension which is vertical in the figure. In the second stage of processing, the sound is filtered by arrival time in each NAP channel, and the products are set out across an array of laterality which is the horizontal dimension in the figure. At high frequencies, where the head creates a significant shadow, the sound in the right ear is also higher in level. The level difference affects perceived laterality much less than the time difference and so, for brevity, the argument is restricted to time differences. The figure illustrates what happens when there is a source 40 degrees to the right containing mid-frequencies, and another source 20 degrees to the left containing high and low frequencies but no mid-frequencies. The components from the source on the right meet on a vertical towards the left; the components from the source on the left meet on a vertical towards the right. This is the standard view of auditory preprocessing for frequency and laterality.
1.2 Separation by Time Interval Now consider what we do and do not know about a sound as it occurs at a point on this frequency-laterality plane. We know that the sound contains energy at a specific frequency, say 1.0 kHz, and that the source of the sound is at a specific angle relative to the head, say 40 degrees to the right. What we do not know is whether the sound is regular or irregular; whether the source is a musical instrument, a barking dog, or a washing machine -- all of which could be at 40 degrees to the right and have energy at 1 kHz. The information about regularity is contained in the temporal fine-structure of the NAP, and that information is not available in this representation. In AIM, peripheral auditory processing includes one further stage, namely, 'strobed' temporal integration (Patterson, 1994b). It is as if the auditory system maintained a dynamic interval histogram for each point on the frequency laterality plane -- a histogram in which the signal to measure time intervals is derived from local maxima in the NAP itself rather than being specified externally by a trigger. This form of temporal integration preserves the distinction between sounds with temporally regular and irregular fine structure while simultaneously revealing modulation of the envelope. In essence, the components of the incoming sound are sorted, or filtered, according to the time intervals in the fine structure of the NAP. The results of this strobed temporal integration are set out along a time-interval dimension which is the depth dimension in the figure, and this dimension completes the space of auditory perception. All of the information associated with a point source appears in one of the vertical planes behind the frequency-laterality plane, and that information is the auditory image of the source at that moment. When the sound is tonal like the vowel /ae/ in 'past', the histograms are regular and related as illustrated by the auditory image in the upper part of Figure 2. When the sound is noisy like the /s/ in 'past', the histograms are irregular and unrelated as illustrated by the auditory image in the lower part of Figure 2. These auditory images reveal the dramatic differences in the time-interval structure of voiced and unvoiced sounds. It is also the case that, when tonal and noisy components occur together in a sound, the regular components form a stable structure that stands out like a figure and the irregular components form a fluctuating, fuzzy background.
2 BINAURAL SEPARATION PRIOR TO TEMPORAL INTEGRATION The description of auditory processing in terms of the space of auditory perception focuses attention on the tripartite intersection where a point on the frequency-laterality plane meets the origin of the time-interval dimension. In AIM, this is the where binaural processing ends and temporal integration begins. This subsection describes a functional version of the binaural mechanism that could, at one and the same time, determine the laterality of the source and combine the NAPs from the left and right ears without smearing regular time-interval patterns in the NAPs. The resulting binaural NAP would have the resolution of the original NAPs, thus enabling the construction of a binaural auditory image with the temporal resolution of the original NAPs. This 'binaural AIM' is a functional model of binaural processing based on the traditional physiological model of laterality (Stern and Colburn, 1978). The primary difference is the suggestion that the binaural coincidence detector could also be used to combine the left and right NAPs into a binaural NAP on a pulse-by-pulse basis at the moment of coincidence.
2.1 Coincidence versus cross-correlation Consider the temporal microstructure of activity that arises in the 500-Hz channel of the frequency-laterality plane in response to a 500-Hz point source 30 degrees to the right of the listener. The upper panel of Figure 3 illustrates the positions of two pulses just before coincidence on the 30 degree line to the left of centre. The period of the sound is 2 ms and the neural encoding mechanism in AIM reduces the upper half of a cycle of basilar-membrane motion to a NAP pulse about half its original duration, so the pulses are 0.5 ms in width and the next pulse in each channel would follow after an interval of about 1.5 ms -- a substantial delay on the binaural time scale. The middle and lower panels of Figure 3 illustrate the operation of two different binaural mechanisms, the standard cross-correlation mechanism in the bottom panel and a 'coincidence gate' mechanism in the middle panel. The latter is AIM's version of coincidence detection. These panels show the state of the mechanism at a time about 0.125 ms later than that in the upper panel by which time the pulses have been interacting for 0.1 ms. Now consider the form of the interaction in the two mechanisms and, in particular, the laterality lines that produce output during the interaction and the length of time over which they produce output. Cross-correlation: In cross correlation, the pulses flow past each other and at each point along the laterality dimension where the pulses overlap, and throughout the time that they overlap, the mechanism produces output as indicated by the bold upwards-pointing arrows. So on the -30 degree line, there is output from the time that the leading edges of the NAP pulses meet until the time that their trailing edges part -- a total duration equal to the sum of the durations of the two pulses. Thus, when cross correlation is applied to uni-polar functions like NAPs, it extends the activity in time and smears the temporal fine structure of the function. Furthermore, laterality lines a considerable distance from -30 degrees produce output for a significant, albeit shorter, duration. The range of laterality lines that produce output is greater than the width of either of the NAP pulses, and in the case of low-frequency channels like 500 Hz, this range encompasses the complete range of lateralities in the system. It is also the case, that the situation shown in the bottom panel is the minimum temporal integration that the process might be expected to apply. Most implementations of cross-correlation perform the cross-multiply and integrate processes on a frame basis and, whereas the greatest temporal overlap in the current example is 1 ms, the frame base in binaural cross-correlation models is typically about 3 ms (Stern and Trihoatis, 1995). Consequently, these models remove the temporal fine structure from the input representation, and, if the fine structure is to appear in the auditory image, it has to get there by a parallel route that avoids this temporal integration. The problems with cross-correlation are well known (Stern and Trihoatis, 1995) and successful binaural models usually employ lateral inhibition or weighting functions in the laterality domain to reduce the off-laterality response, or to sharpen the laterality information of cross-correlation, and produce an accurate measure of the direction of the source. The motivation for these models is limited to determination of the direction of a source; they are not concerned with constructing a representation that preserves sound-quality information, and thus not concerned with the difference between coincidence detection and cross correlation. Coincidence Gating: It is possible to avoid both the temporal smearing and the spread of laterality introduced by cross-correlation with a 'coincidence gate' which 'drains' the activity that causes coincidence from the laterality plane at the moment of coincidence. In this case, the activity cannot proceed across the laterality plane and initiate coincidence on other laterality lines. The principle is illustrated schematically in the middle panel of Figure 3 for the moment 0.125 ms after the moment in the upper panel. When the leading edges of the NAP pulses coincide on the -30 degree line, the event opens a gate at -30 degrees and the activity of both pulses is drained out of both channels until the level of one of the NAP pulses returns to zero, at which point the coincidence gate closes. Thus, at the moment shown in the middle panel, the leading 0.1 ms of the NAP pulses has been removed from the plane. The total duration of the output on the -30 line is the duration of the shorter NAP pulse, so there is no temporal smearing. The activity of the pulses never overlaps on any laterality line other than the one where the leading edges coincide and so there is no laterality spreading. The mechanism operates instantaneously on individual pulses and so there is no temporal integration whatsoever. With regard to interaural time differences (IDTs), the coincidence gate is a close relative of the traditional physiological coincidence model (Stern and Colburn, 1978). The primary difference is that AIM operates on NAP pulses rather than nerve impulses. The NAP pulse is assumed to represent the aggregate activity of all the primary fibers associated with the 0.9 mm of basilar membrane represented by one auditory filter. In the physiological model, individual primary fibers from the left and right cochleas converge on a single unit in the superior olivary complex associated with those two fibers and the current delay between the impulses. In this case, the problem of pulse width and temporal smearing appears as variation in the unit which detects coincidence. In AIM, the draining of activity into the coincidence gate precludes coincidence on more than one laterality line.
2.2 The Binaural NAP The NAPs produced in the left and right cochleas are highly correlated on any measure, but they are, nevertheless, two separate representations of the acoustic information in the environment, and the best description of the information from a particular source is arguably a single NAP, or auditory image, produced from the information in the two NAPs. The logical place to combine the two monaural NAPs in AIM is at the output of the coincidence gate because the mechanism has already time shifted the individual NAP pulses by the appropriate amount in the course of determining coincidence. Thus, the coincidence gate is assumed to operate somewhat as illustrated in Figure 4. On the left, the five panels show how two pulses from the left and right NAPs coincide at -30 degrees and flow into the coincidence gate over the course of 0.5 ms. On the right, the corresponding five panels show the gradual emergence of a single, binaural NAP pulse flowing from the coincidence gate into a vertical plane at right angles to the -30 degree line on the laterality plane. In the figure, the binaural NAP pulse is larger than either of the monaural NAP pulses indicating a form of summation rather than averaging because binaural sounds are louder than their monaural components. Thus in AIM, the coincidence unit is the site where the left and right NAPs are combined, and it is this concern for the combination of the left and right NAPs that distinguishes binaural AIM from traditional physiological models (Blauert, 1983; Stern and Colburn, 1978; Stern and Trihoatis, 1995). The coincidence gate is a relatively minor extension of the traditional model but it makes it possible to describe auditory processing from the initial binaural interaction to the level of the auditory image as a single chain of operations without branching or parallel routes for monaural and binaural information.
3. THE DATA-RATE PROBLEM AND AUDITORY SPECTROGRAMS Although logically feasible, extraction of phonology directly from the auditory image is not yet available. The main problem is the data rate of the auditory model. To ensure that an auditory model is capable of representing all of the discriminations that humans hear, one must digitize the incoming wave with 16-bit accuracy and a sample rate no less than 32 kHz. The filterbank must have no less than 100 channels and so the total data rate is around 3.2 million, 2-byte words per second! In contrast, existing recognition systems typically use some form of LPC or FFT preprocessor which segments the wave into frames about 15 ms in duration and converts the time waveform in that frame into a vector of values that specify the level of activity in a set of different frequency bands. A sequence of such frames is referred to as a spectrogram. Commercial recognizers use 10-20 channels in the analysis, and research systems use 20-50 channels. Thus, a fairly high fidelity commercial system, or a moderate fidelity research system might have 10-ms frames and 32 channels for a data rate of 3.2 kilobytes/sec -- three orders of magnitude lower than the data rate at the output of an auditory model! Thus, the output data rates of auditory models are likely to remain a problem for speech recognizers for the foreseeable future. There would appear to be at least three good reasons for recording and processing the temporal microstructure of basilar membrane motion: binaural source segregation, monaural figure/ground separation, and robust extraction of source features. The primary research questions are the form of auditory model to be used as the speech preprocessor, and the method of reducing the data rate to the 10 kbps range. Binaural processing is better understood (Bodden, 1993) than figure/ground separation or feature extraction at this point in time, and so binaural processing is the focus of this Chapter. We produce a 'binaural NAP' and then reduce the data rate after the binaural processing is complete by averaging the binaural NAP over time with a leaky integrator in the usual way. The result is a 'binaural auditory spectrogram' and its data rate is the same as that of the traditional auditory spectrogram, so it is suitable as input for current recognition systems. Models of binaural processing typically involve correlating the outputs of auditory filters in the left and right cochleas that have the same center frequency to determine whether the activity on one side is a delayed and scaled version of the activity on the other side. The time and level differences are used to estimate the direction of the source. The correlation is performed on the temporal fine-structure because the time delays are on the order of tenths of milliseconds; correlation of time-averaged summaries like LPC coefficients would not reveal such small delays. The vector of direction values can then be used to group channels with a common direction, and to exclude channels from other directions, or channels which do not show strong directionality, as in the case of many noise sources. The assignment of channels to directions occurs on a moment to moment basis. In this way, a binaural model can combine left and right NAPs of noisy stereo speech into a binaural spectrogram with a better signal-to-noise ratio than would be obtained from the left or right NAP on its own. Our strategy for incorporating binaural processing into speech recognition has three stages: 1) Demonstrate that a monaural auditory spectrogram (Patterson, 1994a) will support good phoneme recognition when interfaced to a well known recognition system. 2) Assemble a binaural AIM using a pair of cochlea simulations and a conventional binaural processor, and demonstrate that a binaural auditory spectrogram can enhance the recognition of speech presented in noise when they come from different directions. This is the topic of the Section 4. 3) Compare correlation methods of binaural processing with the coincidence gate, which is beyond the scope of the current paper.
4 THE RECOGNITION SYSTEM AND THE DATA RATE Kohonen self-organising feature maps were used to perform phoneme recognition on monaural and binaural spectrograms produced by AIM (Francis and Anderson, 1997). The results of this 'AIM/Kohonen' recognizer were then compared with those from a 'cocktail-party/Kohonen' recogniser assembled by Boden and Anderson (1995) and an 'MFCC/Kohonen' recognizer. The input spectrograms have the same number of channels in each case and the same frame width; they cover the same frequency range and they all employ the same phoneme recogniser. The Preprocessors: In the case of the traditional recogniser, a vector of 20 Mel-Frequency Cepstral Coefficients (MFCC) spanning the frequency range 430 to 6641 Hz was calculated at 5-ms intervals for each sentence of speech in the data base. The sequence of vectors forms a spectrogram for the sentence and these MFCC spectrograms are the input to the traditional MFCC/Kohonen recogniser. Their data rate is 20/0.05, or 4000 Bytes/s. In the case of the CPP, a 20-channel, cochlea simulation spanning the frequency range 430 to 6641 Hz was applied to each sentence of speech, and the output waves were time averaged with a lowpass filter whose impulse response had a 16-ms, equivalent rectangular duration. The 20-channel smoothed output was then downsampled at 5-ms intervals to produce a spectrographic summary with the same frequency range and frame rate as the MFCC spectrogram. In the case of AIM, the cochlea simulation had 40 channels to improve the performance of the neural transduction module, and after lowpass filtering and downsampling, the frame size was reduced to 20 by averaging adjacent values in each frame. Thus the output of both auditory models is a summary of the speech sound with the same data rate and spectrographic format as that of the traditional MFCC preprocessor. For the binaural models, separate spectrograms were calculated for left and right inputs and then these spectrograms are combined via one of the binaural processors. The speech data used to train and test the Kohonen phoneme recognizer comprised all 10 sentences for 10 speakers of the TIMIT data base. The recogniser was trained on the data of nine of the speakers and tested on the tenth. Then, the train-and-test procedure was repeated using each of the different speakers in turn as the test speaker (Francis and Anderson, 1997). During the development of the SPHINX recognition system (Lee, 1989), the TIMIT phoneme labels were slightly modified. This modified convention was adopted for the present research in order to provide a better means of comparing results with other established systems. This convention yields 39 phones in separate categories. The data for the binaural recognition tests were generated by combining the TIMIT speech from a source positioned directly ahead with a noise source positioned about 30 degrees to the right. The signal-to-noise ratio of the speech was the primary independent variable and it was varied from +21 to -21 dB. Monaural recordings were also made to measure the performance of monaural versions of the two auditory models for comparison with their binaural performance. The Recognizer: Kohonen networks were selected because they have the ability to learn the mapping of an input data space into a pattern space that defines discrimination, or decision, surfaces. This process has been used for phonetic recognition of Finnish and Japanese (Kohonen, 1988). The operation of this network resembles the classical vector-quantization method called k-means clustering. Self-organising feature maps are more general because topologically close nodes are sensitive to inputs that are physically similar. Output nodes will be ordered in a natural manner. Another reason for using Kohonen self-organising feature maps is that they are efficient computationally and the results are representative of systems with much greater computational loads. Kohonen's algorithm adjusts weights from common input nodes to output nodes arranged in a two-dimensional grid. Each input node is connected to every output node. Real-valued input vectors are presented sequentially in time to the network without specifying the desired output. After enough input vectors have been presented, each node's weights will specify a cluster center. These cluster centres approximate the probability density function of the input vectors. The weight adjustment is based on a distortion measure. In this work, a mean squared error distortion measure was used, based on the input and stored weights. The code book size was 256, as is typical of recognition systems employing vector quantization. The calibration process was similar to that used by Kohonen; once trained, learning was turned off (the weights were fixed) and the training data were presented to the feature map a second time. The node that responded to each training token was associated with that token label; the token label with the largest number of responses was deemed the label for that node. Learning Vector Quantization (LVQ) was used on the calibrated code book to adjust the code words for improved performance (Kohonen, 1989). In this work LVQ3 was used. Anderson and Patterson (1994) compared a monaural AIM/Kohonen recogniser with an MCC/Kohonen recognizer and showed that AIM performed significantly better than MCC in terms of phoneme-recognition accuracy. The average, broad-class recognition rates were 76.1 and 71.3 percent correct for AIM and MCC, respectively. This, then, was a demonstration that auditory models are at least as good as traditional spectrographic preprocessors despite the extensive efforts required to tune MCC preprocessors to specific recognizers over the years.
5 SPEECH RECOGNITION WITH BINAURAL SPECTROGRAMS Binaural auditory spectrograms were produced using a pair of cochlea simulations and a binaural processor in two different ways (Francis and Anderson, 1997). One was based on NAPs from a pair of AIM cochleas with a cross-correlator as the binaural processor; it is referred to as 'binaural AIM'. The cross-correlation is applied to individual pairs of left and right NAP channels to produce a cross-correlogram with an explicit lag dimension. The vector of cross-correlation values associated with the direction of the signal is then selected and subjected to leaky integration and downsampling to produce a spectral frame of correlation values every 5 ms. The result is a binaural auditory spectrogram of smoothed cross-correlation values which is the input to the recognizer. Note that the cross-correlation values are not normalized so more intense components of the speech produce higher cross-correlation values. As a result, the binaural spectrogram has information about the relative levels of the speech components as well as the information about the correlation between the ears. The other binaural system was the well known 'cocktail party processor' (CPP) of Blauert (1983) updated by Bodden (1993). Both binaural systems were combined with the Kohonen recognizer to compare their binaural phoneme recognition performance. The Kohonen recognizer was also presented with monaural auditory spectrograms from both auditory models to quantify the advantage of binaural auditory spectrograms. Finally, the Kohonen recognizer was also presented with traditional MFCC spectrograms so that all four auditory preprocessors could be compared with traditional speech systems. The results are presented in Figure 5, where it can be seen that, throughout most of the range, recognition performance based on the binaural auditory spectrograms was significantly better than that based on either monaural auditory spectrograms or on MFCC spectrograms. Broadly speaking, the advantage in terms of signal-to-noise ratio was about 10 dB! That is, for signal-to-noise ratios of 10 dB and worse, switching from a monaural to a binaural system enables operation in conditions where the signal-to-noise ratio is 10 dB worse, without a loss of recognition performance. So binaural auditory spectrograms are much more robust in noise than monaural auditory spectrograms or MFCC spectrograms. Looking in more detail, when the signal-to-noise ratio is high (20 dB or more), recognition performance is near ceiling for all five recognition systems. As the signal-to-noise ratio decreases from 20 to 0 dB, recognition performance remains near ceiling when the input is a binaural auditory spectrogram from either AIM or the CPP; it falls off significantly when the input is a monaural auditory spectrogram or an MFCC spectrogram. Note that the worst recognition performance is not with the MFCC spectrogram but with monaural input from the CPP; the performance with MFCC input is comparable to that of monaural AIM. As the signal to noise ratio decreases from 0 to -21 dB, performance decreases with all forms of spectrogram, but the binaural spectrograms maintain their large advantage throughout the range. Binaural AIM supports a small but significant performance advantage over the cocktail party processor for signal-to-noise ratios greater than -3 dB; the cocktail party processor has a small but significant advantage over binaural AIM for signal-to-noise ratios below -3 dB.
Binaural auditory spectrograms were derived by cross-correlation from neural activity patterns that preserve temporal fine-structure, and tests with a Kohonen phoneme recognizer show that these binaural spectrograms improve the performance of speech recognition systems in noise. The advantage exists over the wide range of signal-to-noise ratios typically encountered in everyday life. In terms of the signal-to-noise ratio required for a given level of recognition performance, the average advantage of the binaural auditory spectrogram is greater that 10 dB when the signal-to-noise ratio is less than 10 dB.
The binaural advantage can be encoded at the output of the auditory preprocessor in a representation comparable to that of the traditional spectrogram; that is, a sequence of spectral vectors with the same low data rate as current MFCC spectrograms (about 3 kbytes/s). A comparison of cross-correlation with a new form of coincidence detection suggests that further improvements can be achieved by combining the left and right neural activity patterns with a coincidence gate that preserves temporal resolution better than cross-correlation.
Acknowledgements This research was supported by the Air Force Office of Scientific Research (AFOSR) through its Window-on-Science Program, the UK Defence Evaluation and Research Agency, Farnborough (ASF/3208) and the U.K. Medical Research Council (G9900369).
Anderson, T.R. and Patterson, R.D. (1994). Speaker Recognition with the auditory image model and self-organising feature maps. In: Proc. ESCA meeting on Speaker Recognition, identification and verication. Martigny, Switzerland. Blauert, J. (1983). Spatial Hearing. MIT Press, Cambridge, Massachusetts. Bodden, M. (1993). Modelling human sound-source localization and the cocktail-party-effect. Acta Acoustica, 1: 43-55. Bodden, M. and Anderson, T.R. (1995). A binaural selectivity model for speech recognition. In: Proc. EuroSpeech95, Sept. Madrid, Spain. Francis, K. and Anderson, T.R. (1997). Binaural phoneme recognition using the auditory image model and cross-correlation. Proc. ICASSP-97, April, Munich, Germany. Kohonen, T. (1988). The neural phonetic typewriter. IEEE computer magazine, 21: 11-22. Kohonen, T. (1989). Self-organisation and associative memory, 3rd ed. Springer-Verlag, Berlin, 1989. Lee, K.F. (1989). Automatic speaker recognition: the development of the SPHINX system. Kluwer Academic, Boston. Lee, K.F and Hon, H.W. (1989). Speaker independent phoneme recognition using Hidden Markov Models. IEEE Trans on ASSP 37: 1621-1648. Patterson, R.D. (1994a). The sound of a sinusoid: Spectral models. J. Acoust. Soc. Am. 96: 1409-1418. Patterson, R.D. (1994b). The sound of a sinusoid: Time-interval models. J. Acoust. Soc. Am. 96, 1419-1428. Patterson, R.D., Allerhand, M., and Giguere, C. (1995). Time-domain modelling of peripheral auditory processing: A modular architecture and a software platform. J. Acoust. Soc. Am. 98: 1890-1894. Stern, R. and Colburn, S. (1978). Theory of binaural interaction based on auditory nerve data. IV: A model for subjective lateral position. J. Acoust. Soc. Am.. 64: 127-140. Stern, R.M. and Trihoatis, C. (1995). Models of binaural interaction. In Hearing. B.C.J. Moore Ed. Academic, London. 207-242.
Figure 1. The space of auditory perception with point sources at -40 and +20 degrees. Figure 2 Auditory images of the phonemes /ae/ and /s/ in the word 'past'. Figure 3. Comparison of cross-correlation and a coincidence gating as models of binaural processing. Figure 4. Schematic of the combination of left and right NAP channels at the coincidence gate. Figure 5. Comparison of the phoneme recognition performance (percent correct) of the traditional MFCC preprocessor with that of two monaural and two binaural auditory models. The binaural auditory models show a progressive advantage as signal-to-noise ratio decreases, reaching 10 dB and more when the signal-to-noise ratio is less than +10 dB.