The Size Information in Vowels
From CNBH Acoustic Scale Wiki
This chapter describes the effect of vocal-tract length on speech sounds and it illustrates the concepts with the developmental data of (Lee et al., 1999).
Roy Patterson , Jessica Monaghan, Tom Walters
Contents |
A Latent Acoustic Scale Variable for Developmental Formant Data
In the formant pattern model of speech communication, it is assumed that children learn to control their speech sounds around age three, and thereafter, the auditory part of their brain constantly monitors their speech sounds and adjusts the cavity sizes to produce the appropriate formant pattern for each vowel. On a logarithmic wavelength axis, the formant pattern remains fixed as the child grows, and it is only the position of the pattern that changes with VTL. Moreover, the position shifts linearly with the logarithm of VTL as the child matures. There have been longitudinal studies of the development of children, but none of these longitudinal studies has included recordings of the children's speech in conjunction with measurements of their vocal tract length. Indeed, there are no longitudinal population studies which include either recordings of children's speech or measurements of their vocal tract lengths. There are several reasons for this: Chief among them is the fact that it is very difficult to run a longitudinal study with a substantial number of individuals lasting five years, let alone the 15 years that would be required to track speech development from say 3 to 18. If the population were represented by even 100 males and 100 females, the costs would be prohibitive; it is only medical trials that are perceived to warrant the necessary resources. Secondly, the labelling of speech data is labour intensive because it is difficult to measure formant frequencies accurately. There is an automatic technique for formant extraction involving LPC analysis, but it is error prone (Lee et al., 1999; Turner et al., 2009). Thirdly, it is very difficult to measure vocal tract length accurately; the tube has a complex shape which is defined on one side by the tongue which takes on many shapes. Moreover, the internal end of the tube is not visible under normal conditions. Until recently, the techniques that did exist for observing the vocal tract involved significant doses of radiation which would have precluded longitudinal studies which require repeated measurements on a large population of naïve volunteers. So it is not surprising that there are no longitudinal population studies detailing variation in formant frequency as a function of vocal tract length. The question, then, is ‘What data are there?’ and ‘What can they tell us about variation in formant frequency as a function of vocal tract length?’
Note that the figures in the document start with number 4. This will be sorted when time permits. RP]
Vocal tract length as a function of height
There is one developmental, albeit not longitudinal, study of vocal tract length by (Fitch and Giedd, 1999); they used magnetic resonance imaging (MRI) to record the anatomy of the vocal tract. The study included 129 subjects (76 males and 53 females) varying in age from 2.8 to 25 years. Great care was taken to define a reliable means of estimating vocal tract length, and subjects who were overweight or whose families had a history of language or developmental disorders were excluded. Figure 4 shows the resulting VTL values plotted as a function of height, separately, for the males (o) and females (+). Within each group, there is a reasonable distribution of subjects across the height range; there are proportionately more men at the tallest heights because, on average, males eventually grow taller than females. The lines show the best fitting linear growth functions for the two sub-populations; the rate of growth is 0.0708 and 0.0611 for the males and females, respectively. The correlation coefficients are 0.8950 and 0.7755, respectively, for the males and females. The correlation coefficients make it clear that there is no doubt that VTL increases with height for both males and females. At the same time, the variability of the data precludes more detailed analysis concerning questions such as whether the rate of growth of VTL is significantly greater for males, and whether the rate changes at puberty. (Turner et al., 2009) used an estimation-maximization technique to determine whether a polynomial with a quadratic term would provide a substantially better fit to the data, taking into consideration the fact that it includes an extra free parameter; the analysis indicated given the variability of the data, a linear fit was a better choice. (Fitch and Giedd, 1999) included a somewhat controversial questionnaire in their study designed to ascertain whether their males had passed the age of puberty, and based on the results they concluded that the rate of growth of the vocal tract does increase at puberty in males. However, in the absence of data on the sexual maturity of individuals, the population statistics do not distinguish males and females with regard to vocal tract length, except insofar as males eventually grow taller than females, on average, and hence, on average, adult males have longer vocal tracts than adult females.
More important for present purposes is the observation that, although the study is large by speech research standards, and the subjects are distributed reasonably evenly throughout the age range, nevertheless, 129 subjects is not a large sample when it comes to defining population growth functions. The functions are sufficiently well defined to show conclusively that VTL increases with height over a large range of heights. But, the data are rather more scattered than one would like to define vocal tract length as a function of height in the population; specifically, the functions are not sufficiently well defined to answer questions about sex differences in subsections of the functions. There are approximately 60 male and female subjects in the age range 3 – 23 years, which means that there are only three subjects available to define each year on the function. The number would have to be closer to ten subjects per year, for each sex group, to provide reasonable estimates of the mean and variance of the growth function along its length; this would require brain imaging and extensive data analysis for 10x20x2, or about 400 subjects, which is prohibitive.
Consider the population studies that the Center for Disease Control (CDC) uses to specify height as a function of age in the American population. The standard was updated in 2000 and it is based on measurements of 400-500 males and females for each half year segment on the chart (about 40,000 people) (2000 CDC Growth Charts for the United States: Methods and Development. Series Report 11, No. 246. 201 pp). The latest growth functions are presented in Figure 5. These growth functions are sufficiently well specified to reveal the growth spurt that occurs at about the time of puberty, and the fact that the growth spurt begins and ends somewhat earlier for girls than for boys. Interestingly, it also shows equally rapid growth in both boys and girls between the ages of 3 and 6 – a fact that is less well known. What the study also shows, however, is that the standard deviation for height as a function of age is quite large, which means that even with a mammoth subject pool, it is not easy to make detailed comparisons of the rate of growth of children within limited age bands.
In summary, data concerning the growth of the vocal tract with height are sufficient to confidently demonstrate that vocal tract length increases with height both for males and for females, and to provide a reasonably accurate estimate of the slope of the growth function. At the same time, the data are not sufficient to specify how the growth functions of the two sexes deviate from the line. There are not enough individuals in this study to define the asymptote of VTL with age for males and females, let alone show that the asymptote occurs at an earlier age for females, as we assume it must. Nor are there sufficient individuals to reveal whether VTL undergoes a growth spurt when height does, as we assume it must, let alone whether the spurt in VTL occurs earlier for females.
The MRI study of (Fitch and Giedd, 1999) is the only population study of vocal tract length; that is, the only study with a direct measure of vocal tract length and a substantial number of participants representing the full range of vocal tract development. It is important to note, however, that the subjects where lying down in the MRI scanner in “the nasal breathing position.” They did not vocalize in the scanner, and (Fitch and Giedd, 1999) did not record the speech of their participants outside the scanner. They did not have the resources to analyze the sounds at that time. So, there are no developmental, population studies in the literature that combine a direct measurement of vocal tract length with an analysis of the speech sounds from the same individuals.
Formant frequency as a function of age
It is also the case that there are no developmental studies of speech sounds which report the heights of the speakers. It is understandable that they do not report VTL since it is so difficult to measure, but it seems surprising that speech scientists would not gather height data as a readily available measure that is highly correlated with VTL. The vocal tract is actually a combination of the nasal, oral and pharyngeal tubes that connect the openings of the nose and mouth to the trachea and the oesophagus; as a result, vocal tract length might have been anticipated to be a simple function of body height, as it has now been shown to be in the data of Fitch and Giedd (1999). However, the scientists who developed the different models of speech production are not the same scientists as those who gathered the developmental data on speech sounds. Vorperian and Kent (2007) recently surveyed the field and reported 22 developmental studies from the past 55 years, beginning with the seminal early study by Peterson and Barney (1952). They describe the situation as follows: “Although chronological age is not necessarily the preferred independent variable in studies of development or maturation, it is the most frequently reported subject descriptor across studies and, in fact, is typically the only reported index (Kent and Vorperian, 1995).”
One of the studies in the Vorperian and Kent review stands out as by far the best candidate for data on how the formant frequencies (and the pitch) of sustained vowel sounds vary with speaker age, and that is the study of Lee, Potamianos and Narayanan (1999). They recorded the speech of 436 children, ages 5-18, and 56 adults. They have ten or more male and 10 or more female participants for each and every year, with one exception – there are only nine females for age 17. Moreover, they have larger numbers of participants for the younger years where the data are more variable. This is by far the largest sample size reported by Vorperian and Kent (2007). Only two of the remaining 20 studies had more than 100 participants: One, by Hodge (1989) has 115 participants, but it is focused on development prior to age five. The other is the well known study of Hillenbrand et al. (1995), but it is not really a developmental study; the children are all 10-12 years of age and the rest of the participants are adults. Accordingly, the focus of this section is the data gathered by Lee et al. (1999).
Lee, Potamianos and Narayanan (1999) recorded two instances of 10 vowels in a neutral context, along with several sentences. They extracted the first three formant frequencies using LPC techniques and they argue that the main developmental features of the data are revealed by the average formant data, that is, the values of the first, second and third formants averaged over vowels separately for each age group. These average formant values are plotted as a function of age, separately for males (circles) and females (crosses) in Figure 6, along with the average pitch data (again averaged over vowels). The data are observed to be very orderly: Firstly, in each case, formant frequency decreases with age for both males and females throughout the age range; similarly, the pitch of the vowels decreases with age for both males and females throughout the age range. The functions are well approximated by lines over this age range, as shown in the figure. The slopes of the lines are presented in Table I along with the correlation coefficients, which are all highly significant. Secondly, the formant frequencies of males and females, and the pitches of the males and females, are essentially the same for the youngest children (age 5-8), but the rate of decrease of both formant frequency and pitch is slightly greater for the males than the females in all cases, leading to a substantial difference in both formant frequencies and pitch between adult males and adult females, on average. Thirdly, the slopes of the lines become more negative as average frequency increases; that is, the slope for the first formant is greater than for the pitch of the voice, the slope for the second formant is greater than that for the first formant, and the slope for the third formant is greater than that for the second formant. Moreover, this is true for both the males and the females.
frequency v's age | frequency v's height | log wavelength v's log height | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
male | female | male | female | male | female | boys | men | |||||||||
gradient | r^{2} | gradient | r^{2} | gradient | r^{2} | gradient | r^{2} | gradient | r^{2} | gradient | r^{2} | gradient | r^{2} | gradient | r^{2} | |
F0 | -12.7036 | 0.8805 | -4.0250 | 0.7662 | -2.4338 | 0.8953 | -0.9462 | 0.8021 | 1.8104 | 0.8243 | 0.5191 | 0.7641 | 0.3959 | 0.7153 | 3.5717 | 0.9279 |
F1 | -18.5659 | 0.9506 | -10.9671 | 0.8467 | -3.5617 | 0.9692 | -2.6186 | 0.9146 | 0.8059 | 0.9403 | 0.4989 | 0.9061 | 0.5160 | 0.8954 | 1.1385 | 0.9199 |
F2 | -46.9976 | 0.9571 | -27.4544 | 0.9394 | -8.9025 | 0.9514 | -6.2655 | 0.9270 | 0.7127 | 0.9115 | 0.4320 | 0.9007 | 0.3916 | 0.8722 | 1.0360 | 0.8590 |
F3 | -71.1770 | 0.9541 | -43.7690 | 0.8888 | -13.5537 | 0.9585 | -10.0183 | 0.8822 | 0.6618 | 0.9233 | 0.4353 | 0.8425 | 0.3642 | 0.8853 | 0.9232 | 0.8901 |
Table I: Gradients and correlation coefficients
With regard to the formant frequencies, all three of these main effects can be explained by the three-tube model of vowel production, or any other competent model of vowel production. The second effect – the fact that the slope of the formant-frequency versus age function is more negative for males than for females – could be explained by the fact that males eventually grow taller than females. The effect is, of course, somewhat more complicated, as will be discussed below. The third effect – the slope becomes steeper as formant number increases – can be explained by the fact that formant number is related to formant frequency. The conclusion is that all of these effects may derive from one, latent size variable; that is, the effects are all directly related to the factor that causes children to grow into adults.
The question is the degree to which this one variable predicts the variability of formant frequencies in the population (for a given vowel), and what other factors might be required to explain consistent deviations from the predictions of a single latent size variable. To begin with, note that the deviations from the regression lines do not appear to be entirely random. The values at ages 9 and 10 tend to be above the regression line, while the values at ages 15 and 16 tend to be a little below the line – effects which are more visible for the higher frequencies where the symbols are a smaller proportion of the frequency value. Although these deviations are small, they may well indicate that there is another factor involved in determining formant frequency as a function of age. For example, the change in the oral-pharyngeal length ratio as a function of age, and the secondary descent of the larynx at puberty. In point of fact, however, before looking for extra factors it is important to remember that vocal tract length is not a strict function of age; it is more closely related to height, and height is probably a better indicator of the latent size variable than is age. For example, humans reach their adult height at about 15 years of age (Figure 5), and so in the formant-frequency functions of Figure 6, there is no further decrease in formant frequency with age above age 15. Indeed, the data points plotted at age 19 are actually for groups of adults; if their data were plotted at their true ages, these data would produce a flat extension to the functions, corresponding to the asymptotes in the height versus age functions of Figure 5. Similarly, the fact that formant-frequency data tend to be above the regression lines at ages 9 and 10 reflects the fact that the children have not yet begun the growth spurt that takes the functions down more sharply between ages 11 and 15. It is observations like these which suggest that height would be a better independent variable than age, if the data are to be used to test models of vowel production where many of the variables are proportionately related to VTL. So the question we want to answer is the degree to which height, as a substitute for VTL which would be the best representative of the latent scale variable, predicts the variability of formant frequencies in the population (for a given vowel), and what other factors might be required to explain consistent deviations from predictions based on height.
Log-wavelength as a function of log-height
The fact that the age values reported by Lee et al. (1999) are accurate to within a year, and the fact that they have data for every year in the range 5-18, means that we can convert the data from their age base to a height base with reasonable confidence, using the ANSI growth functions presented in Figure 5. (This would be the place to point out that the correlation coefficient tends to increase when you switch to height. But we would need to address the issue of whether it is a significant improvement in the fit). Since we want to use the data to test models of vowel production, like the three-tube model presented above, we will convert from formant frequency and pitch frequency to formant wavelength, and pitch wavelength at the same time (i.e., the wavelength corresponding to the pitch period). Finally, in a simple model of vowel production, the increase in slope with formant number observed in the formant frequency data is directly related to VTL. If this is the case in formant production, the slopes of the regression functions should be the same when the data are plotted as log-wavelength versus log-height (or log-VTL). To test this hypothesis, the data were converted to their logarithmic equivalents and regression lines were fitted to each subset of the data in this log-wavelength versus log-height space.
The log-wavelength values for the three formants (averaged over vowel) are plotted as a function of log-height, separately for males (circles) and females (crosses) in Figure 7, along with the pitch wavelengths (averaged over vowel). The conversion from age to height causes an expansion of the spacing of the data points on the height dimension for points in the younger age range, and a bunching together of the data points at the other end of the height dimension for the older ages; it also means that the functions for the men extend further to the right since men eventually grow taller than women, on average. Note that conversion from frequency to wavelength inverts the vertical order of the functions in the figure, so the data for the third formant appear at the bottom of the figure (still represented by red symbols and lines) and the pitch data appear at the top of the figure (still represented by black symbols and lines).
The formant wavelength data are very orderly, as they were before; formant wavelength increases with height in all cases and each subset of the data is well approximated by a straight line, but now, the slope values for the three formants are much more similar, both for the females and the males. This indicates that the variable that causes formant wavelength to increase with height is probably the same for all three formants, as it would be in a simple production system where the growth in formant wavelength is proportional to growth of VTL for all three formants. It is also the case that the gradients for the formant wavelengths of the males are all slightly steeper than the corresponding gradients for the females, just as the gradients of the formant frequency functions were slightly steeper (more negative) for males than for the females. These two relationships are examined in more detail below where the data are normalized to facilitate a more detailed comparison.
The pitch wavelength data (black lines in the top part of Figure 7) present a different picture as might be expected given the adolescent voice change in males. A straight line still provides a good fit to the pitch data of the females, throughout the entire height range, but it provides a poor fit to the pitch data for males. The pitch wavelength for males is essentially the same as for females until age 12 after which the wavelength for males grows at a much greater rate for a couple of years, ending at much greater values for men than for women. The adolescent pitch change in males is well documented by, for example, Hollien, Green and Massey (1994). It is obscured in the pitch frequency data of Figure 6 because it occurs at the low end of the frequency scale where the values are compressed.
Graphical representations that show the full range of formant frequencies, like Figure 6, or the full range of formant wavelengths, like Figure 7, have the advantage or revealing whether there is a relationship between local frequency effects, like the reduction of one particular formant frequency with increasing age, and global frequency effects, like the increase in the gradient of the frequency versus age function as formant number increases. This is important for the current discussion inasmuch as it shows that the two effects are related and might be determined by the same underlying, or latent, variable which in this case is VTL, or the more general growth variable that causes the humans to grow with age. However, these full range graphical representations have the disadvantage that they compress local effects. So the adolescent pitch change in males is obscured by compression in Figure 6, and the divergence of the male third formant values from those of the females is obscured at the bottom of Figure 7.
The wavelength data were normalized to facilitate comparison of the local frequency effects associated with individual formant growth functions, and to examine how the data of the males and females diverge as height increases. The data of the males and females are observed to be very similar for each condition in Figure 7, up to age 11 or 12 where the growth spurts begin. This is true for all three formants and for voice pitch. It is also the case that the data for the females are well fitted by a line on these log-wavelength, log-height coordinates throughout the entire range of their heights; that is, the data from during and after the grow spurt at puberty fall on the same lines as the data from before their growth spurt. Accordingly, we normalized the functions to the average of the log-wavelength values for females, for the points corresponding to ages 5-12. Graphically, this corresponds to separately shifting all of the data for each formant condition, and for the pitch condition, as a group, in the vertically direction, so that the means of the values for females between 5 and 12 moves to the zero point of the ordinate.
The result of the normalization is presented in Figure 8. The normalization immediately reveals that the effect of growth for females is the same for all three formants; the three formant wavelength functions are virtually coincident. Moreover, for females, the growth function for pitch wavelength has the same slope as for the formants. The data for the males have been normalized by the data for the females in the corresponding condition, so that the normalization process preserves female-male differences, and it does not affect the slope values. The figure shows that the growth functions for the formants of the males have steeper slopes than those for the females, and the difference increases a little as formant number decreases; that is, the difference is greatest for the first formant which has the longest wavelength. The abscissa in this representation of the data is height rather than vocal tract length, and the anatomical data of (Fitch and Giedd, 1999) (Figure 4) show that the VTL for males is slightly greater than for females of the same height, at least for people taller than about 1.5 meters. So, at least part of the difference between males and females in the wavelength versus height data is probably due to the difference in VTL between males and females of the same height. Accordingly, the VTL versus height data of (Fitch and Giedd, 1999) were used to convert the wavelength versus height representation to a wavelength versus VTL representation.
At the same time, the data for the three formants have been averaged, separately for the males and females, to focus the remaining discussion on the relative magnitude of the sex differences for pitch wavelength as opposed to formant wavelength. The results are presented in Figure 9. The conversion from height to VTL does reduce the magnitude of the difference in the slopes of the wavelength growth functions of females and males, as expected. It does not completely eliminate the difference, but it is reduced to the point where there is little difference left between the wavelengths of males and females with the same VTL over the full range of female heights. The formant wavelengths for males continue to increase as the height (and the VTL) of the adult male increases above that of the adult female, and so the formant wavelengths for adult males are, on average, greater than those for females. But within the range where they have comparable VTLs, there is little difference in the formant frequencies of males and females.
Log-wavelength as a function of log-VTL
There remains a huge difference between the pitch wavelengths of adult males and females; whereas the normalization removes most of the differences between the formant wavelengths of men and women, it accentuates the difference between the pitch growth functions in the region where height is greater than about 1.5 m, and as a result, as noted above, a line does not provide a reasonable summary of the pitch growth function for the sub-population of men. A much better fit is provided by fitting one line to the data of boys up to age 12, and a separate line to the data of teenage boys and men with ages greater than 12 (who will be referred to collectively as adult males. The results are presented in Figure 10 by two dashed red lines. The pitch function for boys is seen to be slightly 'shallower' than for females as a group; the pitch function for adult males is much steeper than boys or for females as a group. This, then, is a graphic illustration of the adolescent voice change as boys become men as it affects pitch, and a graphic illustration of the fact that there is no comparable effect whatsoever in women. Pitch wavelength does increase a little beyond puberty for women but only in proportion to their continuing increase in height.
The adolescent pitch change in males is often associated with the development of the adam’s apple in adult males and the second descent of the larynx. The description of this phenomenon gives the impression that the length of the vocal tract becomes longer in males than would be expected simply by their increase in height. The data of (Fitch and Giedd, 1999) on the growth of VTL as a function of height does not show a prominent increase in the rate of growth of VTL beyond age twelve in males, and the formant wavelength functions do not exhibit a pronounced increase in the rate of growth of formant wavelength beyond age twelve in males. Nevertheless, to ensure that the linear formant growth functions presented in Figures 8 and 9 are not obscuring an important change in growth rate, the single regression line was replaced with a pair of regressions lines, and as with the pitch wavelengths, one line was fitted to the data of boys up to age 12, and the other was fitted to the data of teenage boys and men with ages greater than 12. The results are presented in Figure 10 by …. There is a reduction in the gradient of the average formant wavelength function for boys, and as in the case of pitch wavelength, the gradient becomes slightly shallower than for females. It is also the case that the gradient of the second of the fitted lines is greater than that of the first, indicating that the gradient of the formant wavelength function increases as boys become adults. However, the effect is observed to be quite small in comparison to the change in pitch wavelength as boys become adults. This suggests that the second descent of the larynx is a relatively small effect, and that the growth of the adams apple has more to do with accommodating an increase in the mass or length of the vocal cords than increasing the length of the vocal tract.
Conclusions
The results suggest that, once care is taken to compensate for the relatively large errors inherent in formant-frequency estimation, that the formant pattern used to communicate between speaker and listener is essentially the same throughout life. It appears that the non-uniform growth of the oral and pharyngeal cavities of the vocal tract, which is clearly documented in the imaging data of (Fitch and Giedd, 1999), does not interfere with the maintenance of the formant patterns that define vowel type during maturation.
References
- Fitch, W.T. and Giedd, J. (1999). “Morphology and development of the human vocal tract: A study using magnetic resonance imaging.” J. Acoust. Soc. Am., 106, p.1511-1522. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
- Hillenbrand, J., Getty, L.A., Clark, M.J. and Wheeler, K. (1995). “Acoustic characteristics of American English vowels.” J. Acoust. Soc. Am., 97, p.3099-3111. [1]
- Lee, S., Potamianos, A. and Narayanan, S. (1999). “Acoustics of children's speech: developmental changes of temporal and spectral parameters.” J. Acoust. Soc. Am., 105, p.1455-68. [1] [2] [3] [4] [5] [6] [7] [8]
- Peterson, G.E. and Barney, H.L. (1952). “Control Methods Used in a Study of the Vowels.” J. Acoust. Soc. Am., 24, p.175-184. [1]
- Turner, R.E., Walters, T.C., Monaghan, J.J. and Patterson, R.D. (2009). “A statistical, formant-pattern model for segregating vowel type and vocal-tract length in developmental formant data.” J. Acoust. Soc. Am., 125, p.2374-2386. [1] [2]
References not in database
Arai, T. (2006). “Sliding three-tube model as a simple educational tool for vowel production,” Acoust. Sci. & Tech. 27, 6
Hodge, M. (1989). “A comparison of spectral–temporal measures across speaker age: Implications for an acoustic characterization of speech maturation,” Unpublished doctoral dissertation, University of Wisconsin—Madison.
Hollien, H., Green, R., & Massey, K. (1994). “Longitudinal research on adolescent voice change in males,” J. Acoust. Soc. Am., 96, 2646–2654.
Kent, R. D., & Vorperian, H. K. (1995). “Anatomic development of the craniofacial-oral-laryngeal systems: A review”, Journal of Medical Speech-Language Pathology, 3, 145–190.
Vorperian and Kent (2007). “Vowel Acoustic Space Development in Children: A Synthesis of Acoustic and Anatomic Data,” Journal of Speech, Language, and Hearing Research, Vol. 50, 1510 –1545