In a text-to-speech system, the duration of each phone may be predicted by a duration model. This model is usually trained using a database of phones with known durations; each phone (and the context it appears in) is characterised by a feature vector that is composed of a set of linguistic factor values. We describe the use of a graphical model – a Bayesian network – for predicting the duration of a phone, given the values for these factors. The network has one discrete variable for each of the linguistic factors and a single continuous variable for the phone’s duration. Dependencies between variables (or the lack of them) are represented in the BN structure by arcs (or missing arcs) between pairs of nodes. During training, both the topology of the network and its parameters are learned from labelled data. We compare the results of the BN model with results for sums of products and CART models on the same data. In terms of the root mean square error, the BN model performs much better than both CART and SoP models. In terms of correlation coefficient, the BN model performs better than the SoP model, and as well as the CART model. A BN model has certain advantages over CART and SoP models. Training SoP models requires a high degree of expertise. CART models do not deal with interactions between factors in any explicit way. As we demonstrate, a BN model can also make accurate predictions of a phone’s duration, even when the values for some of the linguistic factors are unknown.
We present comparative experimental results for three classes of phone duration prediction model: Bayesian networks (BNs), sums-of-product (SoP) models (van Santen, 1992), and Classification and regression trees (CARTs). The principal application for these models is in text-to-speech synthesis.
In text-to-speech systems, it is often necessary to predict the prosody of the output speech; segment durations are an important aspect of prosody. Although in some unit-selection systems, such as Festival 2 (Clark et al., 2004), no prediction of duration is required, this can lead to unpredictable prosody in the output speech. Even if the predicted durations are not imposed on the selected units via signal processing, the prediction of phone duration can still be used to compute a duration component of the target cost. In other cases, such as non-concatenative systems (e.g. Hidden Markov Model approaches, Tokuda et al., 2002) or expressive/emotional speech synthesis (e.g. Strom et al., 2006), explicit prediction of phone durations are necessary. Since duration is a factor affecting listener’s perception of naturalness of synthetic speech (e.g. Mayo et al., 2005), there is still a need for accurate duration predictions.
In common with many other areas of speech and language processing, the databases used to train phone duration models are unbalanced. In the space of all possible combinations of linguistic factor values, only some are linguistically plausible and, of those, only a small fraction will actually be observed in any corpus. Of the observed feature vectors (these are vectors of linguistic factor values), many will be very rare – i.e. low in frequency. However, as was shown by van Santen (1994), the joint probability mass of all these rare vectors taken together is sufficiently large to mean that they cannot simply be neglected. In other words, in any individual sentence, it is very likely that we will encounter one or more of these rare vectors. Therefore, models of phone duration must be robust: they must predict appropriate durations for rare (and indeed previously unseen) vectors.
In addition, there exists a problem of factor confounding: different factors occur with unequal frequencies in the training database. As a result, raw durations calculated from the database can be deceptive. van Santen (1994) gives an example of within-word position and stress factor confounding. Durations of vowels turn out to be shorter in word-final syllables than in non-word-final syllables, if stressed and unstressed vowels are analysed together. But, unstressed vowels are shorter than stressed vowels and word-final syllables are five times more likely to be unstressed than stressed. So, if stressed and unstressed vowels are analysed separately, the vowel duration in final syllables (all other factors being equal) is longer than in non-final syllables, as we would expect.
The linguistic factors affecting a phone’s duration interact with one another; the value of one or more factors may amplify or attenuate the effect of another factor. van Santen (1994) showed that these effects are easily predicted.
A robust model for predicting phone duration must address all of these issues. It should generalise well in order to successfully predict the duration of phones with rare (or previously unseen) feature vectors. It may be desirable to allow some factors to be unspecified or have ambiguous values; this would be the case if these factors’ values are predicted by some other model which is not 100% accurate – for example, part of speech or features relating to the position of syllable boundaries.
We expect a duration model that properly accounts for factor interactions and confounding to be more accurate than a model that does not.
We have demonstrated that Bayesian Network models can be successfully used to predict phone duration and that they outperform CART and SoP models. Building and training these models can be be time consuming but, once the model is trained, it is computationally very cheap to use for duration prediction, since it is essentially a look-up table. The BN structures found here could probably be used directly on other voice databases (i.e. relearn only the parameters, not the network structure), particularly for consonants.