بهبود مدل سازی مدت زمان تلفن با استفاده از همجوشی رگرسیون بردار پشتیبانی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|25304||2011||13 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Speech Communication, Volume 53, Issue 1, January 2011, Pages 85–97
In the present work, we propose a scheme for the fusion of different phone duration models, operating in parallel. Specifically, the predictions from a group of dissimilar and independent to each other individual duration models are fed to a machine learning algorithm, which reconciles and fuses the outputs of the individual models, yielding more precise phone duration predictions. The performance of the individual duration models and of the proposed fusion scheme is evaluated on the American-English KED TIMIT and on the Greek WCL-1 databases. On both databases, the SVR-based individual model demonstrates the lowest error rate. When compared to the second-best individual algorithm, a relative reduction of the mean absolute error (MAE) and the root mean square error (RMSE) by 5.5% and 3.7% on KED TIMIT, and 6.8% and 3.7% on WCL-1 is achieved. At the fusion stage, we evaluate the performance of 12 fusion techniques. The proposed fusion scheme, when implemented with SVR-based fusion, contributes to the improvement of the phone duration prediction accuracy over the one of the best individual model, by 1.9% and 2.0% in terms of relative reduction of the MAE and RMSE on KED TIMIT, and by 2.6% and 1.8% on the WCL-1 database.
In text-to-speech synthesis (TTS) there are two major issues concerning the quality of the synthetic speech, namely the intelligibility and the naturalness (Dutoit, 1997 and Klatt, 1987). The former refers to the capability of a synthesized word or phrase to be comprehended by the average listener. The latter represents how close to the human natural speech, the synthetic speech is perceived. One of the most important factors for achieving intelligibility and naturalness in synthetic speech is the accurate modelling of prosody. Prosody can be regarded as the implicit channel of information in the speech signal that conveys linguistic, paralinguistic and extralinguistic information related to communicative functions. Such functions are the linguistic functions of prominence (stress and accent), the phrasing, the discourse segmentation, the information about expression of emphasis, attitude, assumptions, the emotional state of the speaker, the information about the identify of the speaker (particular with respect to habitual factors). These functions provide to the listener clues supporting the recovery of the verbal message (Clark and Yallop, 1995, Laver, 1980 and Laver, 1994). The accurate modelling and control of prosody in a text-to-speech system leads to synthetic speech of higher quality. Prosody is shaped by the relative level of the fundamental frequency, the intensity and last but not least by the duration of the pronounced phones (Dutoit, 1997 and Furui, 2000). The duration of the phones controls the rhythm and the tempo of speech (Yamagishi et al., 2008) and the flattening of the prosody in a speech waveform would result in a monotonous, neutral, toneless and without rhythm synthetic speech, sounding unnatural, unpleasant to the listener or sometimes even scarcely intelligible (Chen et al., 2003). Thus, the accurate modelling of phones’ duration is essential in speech processing. Several areas of speech technology, among which TTS, automatic speech recognition (ASR) and speaker recognition benefit from duration modelling. In TTS, the correct segmental duration contributes to the naturalness of synthetic speech (Chen et al., 1998 and Klatt, 1976). In hidden Markov model (HMM)-based ASR, state duration models improve the speech recognition performance (Bourlard et al., 1996, Jennequin and Gauvain, 2007, Levinson, 1986, Mitchell et al., 1995 and Pols et al., 1996). Finally, significant improvement of the performance in the speaker recognition task was achieved by Ferrer et al. (2003), when duration-based speech parameters were used for the characterization of the speaker’s voice. Various approaches for segment duration modelling and many factors influencing the segmental duration have been studied in the literature (Bellegarda et al., 2001, Crystal and House, 1988, Edwards and Beckman, 1988, Riley, 1992, Shih and Ao, 1997 and van Santen, 1994). The features related to these factors can be extracted from several levels of linguistic information, such as the phonetic, the morphological and the syntactic level. With respect to the way duration models are built, the duration prediction approaches can be divided in two major categories: the rule-based (Klatt, 1976) and the data-driven methods (Campbell, 1992, Chen et al., 1998, Lazaridis et al., 2007, Monkowski et al., 1995, Rao and Yegnanarayana, 2005, Riley, 1992, Takeda et al., 1989 and van Santen, 1992). The rule-based methods use manually produced rules, extracted from experimental studies on large sets of utterances, or based on previous knowledge. The extraction of these rules requires labour of expert phoneticians. In the most prominent attempt in the rule-based duration modelling category, proposed by Klatt (1976), rules which were derived by analyzing a phonetically balanced set of sentences, were used in order to predict segmental duration. These rules were based on linguistic information such as positional and prosodic factors. Initially a set of intrinsic (starting) values was assigned on each phone which was modified each time according to the extracted rules. Models of this type and similar to this were developed in many languages such as French (Bartkova and Sorin, 1987), Swedish (Carlson and Granstrom, 1986), German (Kohler, 1988) and Greek (Epitropakis et al., 1993 and Yiourgalis and Kokkinakis, 1996), as well as in several dialects such as American English (Allen et al., 1987 and Olive and Liberman, 1985) and Brazilian Portuguese (Simoes, 1990). The main disadvantage of the rule-based approaches is the difficulty to represent and tune manually all the linguistic factors, such as the phonetic, the morphological and the syntactic ones, which influence the segmental duration in speech. As a result, it is very difficult to collect all the appropriate (or even enough) rules without long-term devotion to this task (Klatt, 1987). Consequently the rule-based duration models are restricted to controlled experiments, where only a limited number of contextual factors are involved in order to be able to deduce the interaction among these factors and extract the corresponding rules (Rao and Yegnanarayana, 2007). Data-driven methods for the task of phone duration modelling were developed after the construction of large databases (Kominek and Black, 2003). Data-driven approaches overcame the problem of the extraction of manual rules by employing either statistical methods or artificial neural network (ANN) based techniques which automatically produce phonetic rules and construct duration models from large speech corpora. Their main advantage is that this process is automated and thus significantly reduces the efforts that have to be spent by phoneticians. Several machine learning methods have been used in the phone duration modelling task. The linear regression (LR) (Takeda et al., 1989) models are based on the assumption that among the features which affect the segmental duration there is linear independency. These models achieve reliable predictions even with small amount of training data but do not model the dependency among the features. On the other hand, decision tree models (Monkowski et al., 1995) and in particular classification and regression tree (CART) models (Riley, 1992), which are based on binary splitting of the feature space, can represent the dependencies among the features but cannot insert constraints of linear independency for reliable predictions (Iwahashi and Sagisaka, 2000). Another technique which has been used on the phone duration modelling task is the sums-of-products (SOP), where the segment duration prediction is based on a sum of factors and their product terms that affect the duration (van Santen, 1992 and van Santen, 1994). The advantage of these models is that they can be trained with a small amount of data. Bayesian networks models have also been introduced on the phone duration prediction task. These models incorporate a straightforward representation of the problem domain information and despite their time consuming training phase, they can make accurate predictions even when unknown values come across in some features (Goubanova and King, 2008 and Goubanova and Taylor, 2000). Furthermore, instance-based algorithms (Lazaridis et al., 2007) have been used in phone duration modelling. In instance-based approaches the training data are stored and a distance function is employed during the prediction phase in order to determine which member of the training set is closer to the test instance and predict the phone duration. In a recent study (Yamagishi et al., 2008), the gradient tree boosting (GTB) (Friedman, 2001 and Friedman, 2002) approach was proposed for the phone duration modelling task as an alternative to the conventional approach using regression trees. The GTB algorithm is a meta-algorithm which is based on the construction of multiple regression trees and consequently taking advantage of them. On the task of syllable duration modelling various neural networks have been used, including feedforward neural networks (Campbell, 1992 and Rao and Yegnanarayana, 2007) and recurrent neural networks (RNN) (Chen et al., 1998). Furthermore, in the case of syllable duration prediction the SVM regression model has been used in order to perform the function estimation from the training instances using non-linear mapping of the data onto a high-dimensional feature space (Rao and Yegnanarayana, 2005). Iwahashi and Sagisaka (2000) proposed a scheme for statistical modelling of prosody control in speech synthesis. It is based on a combination of regression trees and linear regression models. It offers a mechanism for evading the disadvantages inherent to one algorithm by benefiting from the advantages provided by another algorithm. This can be explained by the observation that different algorithms perform better in different conditions. As a result, the task of phone duration modelling based on the data-driven approaches gives the ability to overcome the time consuming labour of the manual extraction of the rules which are needed in the rule-based approaches. However, as shown by van Santen and Olive (1990), these methods are not always satisfactory for the task of phone duration prediction. All previous studies on phone and syllable duration modelling are restricted to the use of a single linear or non-linear regression algorithm. The only exception to this trend is the work of (Iwahashi and Sagisaka, 2000), where a hierarchical structure for syllable duration prediction using the outputs of a phone duration model was used. However, this structure is restricted to the post-processing of a single duration prediction model, and no extension to a parallel regression fusion of the duration predictions of multiple models has been studied. In the present work, aiming at improving the accuracy of the prediction of the segmental durations (here phone durations), we propose a fusion scheme based on the use of multiple dissimilar phone duration predictors which operate on a common input, and whose predictions are combined using a regression fusion method. The proposed scheme is based on the observation that predictors implemented with different machine learning algorithms perform differently in dissimilar conditions. Hence, we suppose that an appropriate combination of their outputs could result in a new set of more precise phone duration predictions. Thus, an appropriate fusion scheme that can learn how to combine the outputs of a number of individual predictors in a beneficial manner, will contribute to the reduction of the overall prediction error, when compared to the error of each individual predictor. Based on this assumption, we investigate various implementations of the proposed fusion scheme and study its accuracy for duration prediction on different levels of granularity: vowels/consonants, phonetic category and individual phones. In this connection, initially, we investigate the performance of eight linear and non-linear regression algorithms, five of them already examined in previous studies (Iwahashi and Sagisaka, 2000, Lee and Oh, 1999, Riley, 1992, Takeda et al., 1989 and Yamagishi et al., 2008) as individual predictors. These are based on linear regression and decision trees – model trees, regression trees and pruning decision trees. Furthermore, another two of them – the meta-learning algorithms, additive regression and bagging, using REPTrees as base classifier – are modifications of algorithms that were already studied in the phone duration prediction task (Yamagishi et al., 2008), and finally, the support vector regression (SVR) algorithm, which to our best knowledge has not yet been employed on the phone duration prediction task. Next, the durations predicted by the individual duration models are fed as inputs to a machine learning algorithm referred to as fusion model, which uses these predictions and produces the final phone duration prediction. For the purpose of fusion, we evaluate 12 different (linear and non-linear) regression fusion techniques, which are the linear regression, decision trees, support vector regression, neural networks, meta-learning and lazy-learning algorithms, and finally average linear combination and best-case fusion. The present study was inspired by the work of (Kominek and Black, 2004), where a family of acoustic models, providing multiple estimates for each boundary point, was used for segmenting a speech database, creating synthetic speech of higher quality using a corpus-based unit selection TTS system. This approach was found more robust than a single estimate, since by taking consensus values large labelling errors are less prevalent in the synthesis catalogue, which improves the resulting synthetic speech. To the extent of our knowledge, a parallel regression fusion of individual models has not yet been studied on the phone duration prediction or on the syllable duration prediction tasks. Furthermore, although SVR models have been used for syllable duration prediction (Rao and Yegnanarayana, 2005), to this end, they have not been employed on the phone duration prediction task. The remainder of this article is organized as follows. In Section 2 we outline the proposed fusion scheme. In Section 3 we briefly outline the individual phone duration modelling algorithms, the algorithms used in the fusion scheme, the speech databases and the experimental setup used in the evaluation. The experimental results are presented and discussed in Section 4 and finally this work is concluded in Section 5.
نتیجه گیری انگلیسی
In this work we studied the accuracy of various machine learning algorithms on the task of phone duration modelling. The experimental results showed that on this task, support vector machines (SVM), as a regression model, outperforms various other machine learning techniques. Specifically, in terms of relative decrease of the mean absolute error and root mean square error, the SMO regression model outperformed the second-best model by approximately 5.5% and 3.7% on KED TIMIT, and by approximately 6.8% and 3.7% on the WCL-1 database, respectively. Furthermore, the proposed fusion scheme, which combines predictions from multiple individual phone duration models, operating on a common input, takes advantage of the observation that different prediction algorithms perform better in different situations. The experimental validation demonstrated that the fusion scheme improves the accuracy of phone duration prediction. The SVM-based fusion algorithm was found to outperform all other fusion techniques. Specifically, the fusion scheme based on the SVM regression algorithm outperformed the best individual predictor (SVM regression) by approximately 1.9% and 2.0% in terms of relative reduction of the mean absolute error and root mean square error, respectively, on the KED TIMIT database, and by 2.6% and 1.8% on the WCL-1 database, respectively.