طبقه بندی آوایی گسترده با استفاده از شبکه های بیزی افتراقی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
28762 | 2009 | 16 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Speech Communication, Volume 51, Issue 2, February 2009, Pages 151–166
چکیده انگلیسی
We present an approach to broad phonetic classification, defined as mapping acoustic speech frames into broad (or clustered) phonetic categories. Our categories consist of silence, general voiced, general unvoiced, mixed sounds, voiced closure, and plosive release, and are sufficiently rich to allow accurate time-scaling of speech signals to improve their intelligibility in, e.g. voice-mail applications. There are three main aspects to this work. First, in addition to commonly used speech features, we employ acoustic time-scale features based on the intra-scale relationships of the energy from different wavelet subbands. Secondly, we use and compare against discriminatively learned Bayesian networks. By this, we mean Bayesian networks whose structure and/or parameters have been optimized using a discriminative objective function. We utilize a simple order-based greedy heuristic for learning discriminative structure based on mutual information. Given an ordering, we can find the discriminative classifier structure with O(Nq)O(Nq) score evaluations (where q is the maximum number of parents per node). Third, we provide a large assortment of empirical results, including gender dependent/independent experiments on the TIMIT corpus. We evaluate both discriminative and generative parameter learning on both discriminatively and generatively structured Bayesian networks and compare against generatively trained Gaussian mixture models (GMMs), and discriminatively trained neural networks (NNs) and support vector machines (SVMs). Results show that: (i) the combination of time-scale features and mel-frequency cepstral coefficients (MFCCs) provides the best performance; (ii) discriminative learning of Bayesian network classifiers is superior to the generative approaches; (iii) discriminative classifiers (NNs and SVMs) perform better than both discriminatively and generatively trained and structured Bayesian networks; and (iv) the advantages of generative yet discriminatively structured Bayesian network classifiers still hold in the case of missing features while the discriminatively trained NNs and SVMs are unable to deal with such a case. This last result is significant since it suggests that discriminative Bayesian networks are the most appropriate approach when missing features are common.
مقدمه انگلیسی
Automatic broad speech unit classification is crucial for a number of different speech processing methods and various speech applications. We define broad phonetic classification as processing that maps a speech signal into a sequence of integers, where each integer represents a coarser-grained category than that of a phone. While mapping to a sequence of phones, or at least a distribution over such sequences, is a favored approach to automatic speech recognition (ASR), broad phonetic classification is useful for a number of distinct applications. For example, some speech coding and compression systems use broad phonetic classification to determine the number of bits that should be allocated for each speech frame (Kubin et al., 1993). Such a source-controlled variable rate coder would for example allocate more bits to voiced and mixed frames than to unvoiced frames, and would assign only a few bits to silence frames (Zhang et al., 1997). In Internet telephony applications (Sanneck, 1998), for example, the adaptive loss concealment algorithm is based on a voiced/unvoiced detector at the sender. This helps the receiver to conceal the loss of information due to the similarity between the lost segments and the adjacent segments. As another example, the utilization of information about broad phonetic classes can improve the perceptual quality of time-scaling algorithms for speech signals (Kubin and Kleijn, 1994) – a desirable capability in voice-mail and voice-storage applications as it allows the user to listen to messages in a fraction of the original recording time. A speech utterance can be efficiently time-scaled by applying different scaling factors to different speech segments, depending on the broad phonetic characteristics, without reducing its quality and naturalness (Donnellan et al., 2003). It was concluded in (Kuwabara and Nakamura, 2000) that voiced frames need to be more affected by time-scaling than mixed frames, and much more than unvoiced frames (Campbell and Isard, 1991). To maintain the characteristics of plosives or parts of plosives (a closure or release), time-scale modification should not be so applied. Silence frames, moreover, should be treated like voiced frames (Donnellan et al., 2003). A broad phonetic classifier can also be used as a pre-classification step to support the phonetic transcription task of very large databases thereby making the transcriber’s job much easier and less costly. Furthermore, it can be used as a step in addition to word labeling for preparing corpora for concatenative synthesis. Broad phonetic classification can also be fused into standard speech recognition systems at levels other than the acoustic feature vector (Subramanya et al., 2005 and Bartels and Bilmes, 2007) and can also be used to facilitate out-of-vocabulary (OOV) detection (Lin et al., 2007). In order to improve robustness of automatic speech recognition, moreover, Kirchhoff et al. (2002) investigated the benefits of articulatory phonetics by using 28 articulatory features, both as an alternative to, and in combination with standard acoustic features for acoustic modeling. For a similar purpose, framewise phonetic classification of the TIMIT database has been performed using Gaussian mixture models (GMMs) for four manner classes (Halberstadt and Glass, 1997), and support vector machines (SVMs) (Salomon et al., 2002) and large margin GMMs (Fei and Saul, 2006) have been used for 39 phonetic classes. Recently, ratio semi-definite classifiers have been developed and applied to phoneme classification (Malkin and Bilmes, 2008). In this article, several general-purpose broad phonetic classifiers have been developed for classifying speech frames into either four or six broad phonetic classes. Beside the silence class (S), we also consider a voiced class (V) which includes vowels, semivowels, diphthongs and nasals, an unvoiced class (U) which includes only unvoiced fricatives, and a mixed-excitation class (M) including voiced and glottal fricatives. Furthermore, we are interested in plosives that are formed by two parts, a closing and a release (R) of a vocal-tract articulator. Normally, plosives have a transient characteristic, whereas, voiced, unvoiced, and mixed sounds are continuant sounds. While the closed interval of unvoiced plosives is similar to silence, voiced plosives have a subtle voiced closure interval (VC) which has a periodic structure at very low power (Olive et al., 1993). There are three main contributions of this work: (1) in tandem with more traditional acoustic features, we employ wavelet derived acoustic features that are useful to represent speech in, e.g. the aforementioned VC interval; (2) we use discriminatively learned Bayesian network classifiers and their comparison to standard discriminative models of various forms; and (3) we provide results that compare the various classifiers in particular in the case of missing acoustic features. These contributions are summarized in this section and then fully described within the article. First, in order to improve the detection of subtle cues in our broad phonetic categories, we use wavelet derived features in addition to commonly used time domain (Kedem, 1986 and Childers et al., 1989) and mel-frequency cepstral coefficients (MFCC) features. We extract time-scale features by applying the discrete wavelet transform (DWT) and then by performing additional processing thereafter (full details are given below). We show that the intra-scale relations of the energy from different wavelet subbands are beneficial to reflect the acoustic properties of our phonetic classes. Numerous classification approaches have been proposed to classify speech units given a set of speech features in the past with one of the earliest being that of Atal and Rabiner (1976). In this work, by speech unit classification, we specifically mean frame-by-frame classification, where the speech signal has been segmented into overlapping fixed-length time windows, and where each window is then input to a classifier whose goal it is to decide what the correct category is of the speech at the center of that window. This then becomes a standard pattern classification problem. Generally, there are two avenues for such classifiers, generative and discriminative (Jebara, 2001, Bilmes et al., 2001, Bahl et al., 1986, Ephraim et al., 1989, Ephraim and Rabiner, 1990, Juang and Katagiri, 1992, Juang et al., 1997, Bishop and Lasserre, 2007 and Pernkopf and Bilmes, 2008). Let X1:NX1:N be a set of N features and C be a class variable. Generative models in one way or another represent the joint distribution p(X1:N,C)p(X1:N,C) or at least p(X1:N|C)p(X1:N|C). Generative models can be trained either generatively (which means optimizing an objective function that is maximized when the joint distribution scores a data set highly, such as penalized maximum-likelihood (ML)) or can also be trained discriminatively (which means to use a discriminative objective to train a generative model ( Pernkopf and Bilmes, 2008)). Discriminative models are those that inherently represent either the conditional distribution p(C|X1:N)p(C|X1:N) directly, or alternatively represent only the decision regions in X1:NX1:N between classes, and are specified based on some discriminant function f(X1:N,C)f(X1:N,C) which have no normalization constraints (and thus are not guaranteed to provide a probabilistic interpretation, only the rank order is important). Discriminative models are trained using only discriminative objective functions, such as conditional likelihood or some form of exact or smoothed loss function ( Bartlett et al., 2006). Generative approaches (such as the Gaussian mixture model (Leung et al., 1993 and Duda et al., 2001) or the hidden Markov Model (Levinson et al., 1989 and Rabiner, 1989)) have in the past been used for phonetic classification as well as speech recognition. Some of the most prominent discriminative models are neural networks (Bishop, 1995, Mitchell, 1997 and Duda et al., 2001) (NNs) and support vector machines (Schölkopf and Smola, 2001 and Burges, 1998) (SVMs) which have also been widely applied to the problem of speech classification (Bourlard and Morgan, 1994, Minghu et al., 1996, Salomon et al., 2002, Smith and Gales, 2002, Pham and Kubin, 2005 and Borys and Hasegawa-Johnson, 2005) although this limited set of references does not do the field justice. Our second main contribution in this work is that we employ discriminatively learned Bayesian network classifiers. Specifically, we apply both discriminative parameter learning by optimizing conditional likelihood (CL) and generative maximum-likelihood (ML) parameter training on both discriminatively and generatively structured Bayesian networks. We use either CL or classification rate (CR) (equivalently, empirical risk) for producing discriminative structure. These classifiers are further restricted to be either naive Bayes (NB) classifiers (where all features are assumed independent given the class variable), and relaxations of such an approach (where the features are no longer presumed independent given the class, such as 1-tree or 2-tree augmented naive Bayes (TAN)). We use an algorithm for discriminative structure learning of Bayesian networks based on a computed variable order ( Pernkopf and Bilmes, 2008). The proposed metric for establishing the ordering of the features is based on the conditional mutual information. Given a resulting ordering, we can find the discriminative network structure with O(Nq)O(Nq) score evaluations (constant q limits the number of parents per node). Hence, e.g. the TAN classifier can be discriminatively optimized in O(N2)O(N2) queries using either CL or CR as a evaluative score function. We present results for framewise broad phonetic classification using the TIMIT database ( Lamel et al., 1986). We provide classification results using Bayesian network classifiers on time-scale features and on MFCC features. Additionally, we compare our Bayesian network classifiers to GMMs, NNs, and SVMs on the joint time-scale and MFCC feature set. Gender dependent and gender independent experiments have been performed to assess the influence on the classification rate (CR). A third contribution of our work is in the case of missing features. A primary advantage of our generative Bayesian networks over standard discriminative models (such as NNs and SVMs) is that they can be applied to cases where some of the features are at times missing (or known to be highly unreliable and thus useless). This is done essentially by marginalizing over the unknown (or unreliable) variables, something that is still possible since the model is inherently generative, even if it is discriminatively trained. Spectro-temporal regions of speech which are dominated by noise can, for example, be treated as missing or unreliable (Cooke et al., 2001 and Raj and Stern, 2005). What is not known, however, is if discriminatively trained generative models still hold a performance advantage in the broad phonetic classification domain, something which we investigate and verify in this work. In particular, we find that discriminatively trained Bayesian network classifiers still hold an advantage over generatively trained ones in the case of missing features. The paper is organized as follows: our DWT and multiresolution analysis is introduced in Section 2.1. Section 2.2 studies intra-scale relations of the energy from different wavelet subbands with respect to the phonetic classes. This section also introduces the time-scale features used for classification. Section 3 introduces Bayesian network classifiers and different network structures. The most commonly used approaches for generative and discriminative structure learning are summarized in Section 3.2. Section 3.3 describes our OMI heuristic for efficient discriminative structure learning. Experiments on the TIMIT database and the discussion are presented in Section 4. Section 5 concludes and gives perspectives for future research. The abbreviations are summarized in Appendix B.
نتیجه گیری انگلیسی
Bayesian networks, Gaussian mixture models, neural networks, and support vector machines are used to classify speech frames into the broad phonetic classes of silence, voiced, unvoiced, mixed sounds, and two more categories voiced closure and release of plosives. The classification is based on time-scale features derived from the discrete Wavelet transform, on MFCCs, and on the combination of both. Gender dependent/independent experiments have been performed using the TIMIT database. Discriminative and generative parameter and/or structure learning approaches are used for learning the Bayesian network classifiers. We introduce a simple order-based greedy heuristic for learning a discriminative Bayesian network structure. We show that the proposed metric for establishing the ordering is performing better than simple random ordering. We observed that the time-scale features and the MFCC features complement each other. The combination of both feature sets improves the (absolute) classification accuracy by ∼0.5%∼0.5%. Discriminative structure learning of Bayesian networks is superior to the generative approach. In particular, the discriminative 2-tree Bayesian network classifier significantly outperforms all other Bayesian network classifiers and the Gaussian mixture model. The best classification performances are achieved with neural networks and support vector machines. However, in contrast to neural network and support vector machines, a Bayesian network is a generative model. A generative model has the advantage that it is easy to work with missing features, and generative Bayesian network can still be trained and structured discriminatively without loosing its generative capability. We show that discriminatively structured Bayesian network classifiers are superior to generative approaches even in the case of missing features. Future work will focus on the application of the broad phonetic classifier for speech modification such as time-scaling. Based on the phonetic information of every speech frame, the proper time-scaling factors are assigned to achieve a better quality and naturalness of scaled speech sound. Additionally, we intend to investigate the influence of the broad phonetic classification to the selection of proper smoothing strategies at concatenation points for preparing databases for concatenative synthesis.