شناسایی خشم در گفتار با استفاده از نشانه های صوتی و زبانی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|33410||2011||12 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Speech Communication, Volume 53, Issues 9–10, November–December 2011, Pages 1198–1209
The present study elaborates on the exploitation of both linguistic and acoustic feature modeling for anger classification. In terms of acoustic modeling we generate statistics from acoustic audio descriptors, e.g. pitch, loudness, spectral characteristics. Ranking our features we see that loudness and MFCC seem most promising for all databases. For the English database also pitch features are important. In terms of linguistic modeling we apply probabilistic and entropy-based models of words and phrases, e.g. Bag-of-Words (BOW), Term Frequency (TF), Term Frequency – Inverse Document Frequency (TF.IDF) and the Self-Referential Information (SRI). SRI clearly outperforms vector space models. Modeling phrases slightly improves the scores. After classification of both acoustic and linguistic information on separated levels we fuse information on decision level adding confidences. We compare the obtained scores on three different databases. Two databases are taken from the IVR customer care domain, another database accounts for a WoZ data collection. All corpora are of realistic speech condition. We observe promising results for the IVR databases while the WoZ database shows lower scores overall. In order to provide comparability between the results we evaluate classification success using the f1 measurement in addition to overall accuracy figures. As a result, acoustic modeling clearly outperforms linguistic modeling. Fusion slightly improves overall scores. With a baseline of approximately 60% accuracy and .40 f1-measurement by constant majority class voting we obtain an accuracy of 75% with respective .70 f1 for the WoZ database. For the IVR databases we obtain approximately 79% accuracy with respective .78 f1 over a baseline of 60% accuracy with respective .38 f1.
Detecting emotions in vocal human-computer interaction (HCI) is gaining increasing attention in speech research. Moreover, classifying human emotions by means of automated speech analysis is achieving a level of performance, which makes effective and reliable deployment possible. Emotion detection in interactive voice response (IVR) systems can be used to monitor quality of service or to adapt emphatic dialog strategies (Yacoub et al., 2003 and Shafran et al., 2003). Anger recognition in particular can deliver useful information to both the customer and the carrier of IVR platforms. It may indicate potentially problematic turns or slots, which could in turn lead to improvements or refinements of the system. It can further serve as a trigger to switch between tailored dialog strategies for emotional conditions to better react to the user’s behavior (Metze et al., 2008 and Burkhardt et al., 2005a), including the re-routing of customers to a human operator for assistance when problems occur. There are many ways, in which a person’s emotion can be conveyed. However, in the present voice-based scenario, two factors prevail: the choice of words and acoustic variation. When a speaker expresses an emotion while adhering to an inconspicuous intonation pattern, human listeners can nevertheless perceive the emotional information through the lexical content. On the other hand, words that are not generally emotionally salient can certainly be pronounced in a way which conveys the speaker’s emotion in addition to the mere lexical meaning. Consequently, our task is to capture the diverse acoustic and linguistic cues that are present in the speech signal and to analyze their correlation to the speaker’s emotion. Our linguistic approach analyzes the lexical information contained in the spoken word and its correlation to the emotion of anger. The level of anger connotation of a word can be estimated using various concepts. First, we apply the concept of Emotional Salience (Lee and Narayanan, 2005 and Lee et al., 2008), which models posterior probabilities of a class given a word and combines this information with the prior probability of a class. This concept can be extended to include contextual information by modeling the salience of not just one word, but word combinations, i.e. n-grams. Further, we compare these models to traditional models from the related field of information retrieval, i.e. models that estimate term frequencies (TF) or words used (Bag-of-Words, BOW) as explained in Section 4. Our prosodic approach examines expressive patterns that are based on vocal intonation. Applying large-scale feature extraction, we capture these expressions by calculating a number of low-level acoustic and prosodic features, e.g. pitch, loudness, MFCC, spectral information, formants and intensity. We then derive statistics from these features. Mostly, the statistics encompass moments, extrema, linear regression coefficients and ranges of the respective acoustic contours. In order to gain insight into the importance of our features we rank them according to their information-gain ratio. Looking at high-ranked features we report on their distribution and numbers in total, as well as in relation to each other. Only the most promising features are retained in the final feature set for acoustic classification. In a final step, we fuse information from both linguistic and acoustic classification results to obtain a complex estimate of the emotional state of the user. We compare our features for three different corpora. One database comprises American English IVR recordings (Schmitt et al., 2010), another contains German IVR recordings (Burkhardt et al., 2009). Both databases account for mostly adult telephony conversations with customer-care hotlines and contain a high number of different speakers. A third database comprises recordings from a Wizard of Oz (WoZ) scenario conducted with a small number of German children (Steidl et al., 2005). 2. Related work and realistic database conditions When comparing existing studies on anger recognition, one has to be aware of the precise conditions of the underlying database design, as many of the results published hitherto are based on acted speech data. Some of these databases include sets of prearranged sentences. Recordings are usually done in studios, minimizing background noise, recording speakers (one at a time) multiple times until a desired degree of expression is reached. Real life speech does not have any of these settings. As much as 97% accuracy has been reported for the recognition of angry utterances in a 7 class recognition test performed by humans on the TU Berlin EMO-DB (Burkhardt et al., 2005b), which is based on speech produced by German-speaking professional actors. The lexical content is limited to 10 pre-selected sentences, all of which are conditioned to be interpretable in six different emotional and a neutral-speech contexts. The recordings have wideband quality. Experiments on a subset, which featured high emotion recognition rates and high naturalness votes, both by human listeners, resulted in 92% accuracy when Schuller (2006) classified for the emotions and neutral speech automatically. Comprising of mostly read sentences, but also some free text passages, a further anger recognition experiment was carried out on the DES database by Enberg and Hansen (1996). The accuracy for classification into 5 classes in a human anger recognition experiment resulted in 75%. All recordings are of wide band quality as well. Classifying this database automatically, Schuller (2006) reported an accuracy of 81%. When speakers are not acting, namely when there is no professional performance, we need to rely on the impressions of a number of independent listeners. Since no agreed-upon common opinion exists on how a specific emotion ‘sounds’, it has become standard practice to take into account the opinion of several raters. To obtain a measurement for consistency of such ratings, an inter-labeler agreement measure is often applied. It is defined as the count of labeler agreements, corrected for chance level and divided by the maximum possible count of such labeler agreements. It should be noted that the maximum agreement also depends on the task, as for example the inter-labeler agreement in a gender recognition task is expected to be higher than that in an anger rating task. We assume that low inter-labeler agreement on the different emotion categories in the training and test data would predict a low automatic classification score, since in cases where humans are uncertain about classification, the classifier would likewise have difficulty in differentiating between the classes. Batliner et al. (2000) further analyzes emotion recognition performance degradations when comparing acted speech data, read speech data and spontaneous speech obtained from a WoZ scenario. Performances on acted speech data were much better in all considered experiments. Lee and Narayanan (2005) as well as Batliner et al. (2000) used realistic narrow-band IVR speech data from call centers. Both applied binary classification with Batliner et al. (2000) discriminating angry from neutral speech, and Lee and Narayanan (2005) classifying for negative versus non-negative utterances. Given a two class task, it is very important to know the prior probabilities of class distribution. Batliner et al. (2000) reach an overall accuracy of 69% using Linear Discriminative Classification (LDC). Unfortunately no class distribution or inter-labeler agreement for his corpus is given. Lee and Narayanan (2005) reach a gender-dependent accuracy of 81% for female and 82% for male speakers. They measured inter-labeler agreements as 0.45 for male and 0.47 for female speakers, which can be interpreted as moderate agreement. For both gender classes, constant voting for the non-negative class would achieve about 75% accuracy already and – without any classification – would outperform the results obtained by Batliner et al. (2000). Exploiting acoustic and linguistic information Schuller et al., 2004 and Lee and Narayanan, 2005 apply late fusion strategies. Using predominantly acted emotions from the automotive domain Schuller et al. (2004) combines acoustic and linguistic information in order classify into seven emotional states. Extracting few acoustic features, the main difference to the present work lies within the incorporation of linguistic information. He hierarchically clusters individual words into bigger phrase and super-phrase levels using belief networks. Also Lee and Narayanan (2005) use few acoustic features and combine them with linguistic information using Emotional Salience models by averaging on decision level. They propose to calculate activations from Emotional Salience scores and calculates accuracies in a gender dependent way. In order to compare the performance of our linguistic and acoustic models for the different databases, we calculate classification success using two evaluation scores: accuracy and the f1-measure. Given the skewed class distribution, the accuracy measure overestimates if the model of the majority class yields better results than the models for the non-majority classes. As reported above, such an inequality in model performance is not uncommon.1 We therefore focus on the unit function f1-measurement. It is defined as the (unweighted) average of F-measures from all classes, which in turn account for the harmonic mean of both precision and recall of a given class. However, in order to be comparable to other works, we also show accuracy figures. It should be noted that comparisons between results of studies that use different evaluation measures are thus often biased and, in some cases, may even be invalid. Publications contributed to the INTERSPEECH 2009 Emotion Challenge (Schuller et al., 2009b) give a good overview of recent developments in terms of classifier diversity and acoustic feature modeling. All participant publications are based on the same training and test corpus definitions, and the results are therefore more comparable than results from single case studies. The present study also includes the benchmark corpus, i.e. the realistic Aibo corpus as presented in Section 3. Prevailing classification algorithms applied in the benchmark are Support-Vector-Machines (SVM), Gaussian-Mixture-Models (GMM) as well as the combination of them, i.e. GMM-SVM Super-Vector approach, as presented by Dumouchel et al. (2009). Also, dynamic GMM-HMM approaches, as widely used in speech recognition, were proposed ( Vlasenko and Wendemuth, 2009) among other methods. The best scores were generally obtained by fusing the several classifiers at the decision level. All those models are based on acoustic, including prosodic, feature extraction. Polzehl et al. (2009b) proposed to include linguistic knowledge, namely to model information that can be drawn from the words the speakers used. The best systems reached average recalls, which was the primary evaluation criterion, of approximately 70% and an accuracy of 69%, which, after all, represents only a small improvement over constant majority class voting. Overall, as most results from the different systems in the benchmark were very close, the challenge illustrated the difficulty in recognizing emotions from speech. The present study elaborates on the exploitation of both linguistic and acoustic feature modeling and the application of decision fusion. In the context of human-machine interaction, analyses of emotional expressions are generally aimed at the design of Embodied Conversational Agents. This predominantly relates to application in automated dialog systems. Related to research on human–computer interaction, also human–human interaction has been analyzed for emotions. Although the matter of interest is identical, a deeper look into the differences reveals that emotionally colored speech is more likely to be encountered in human–human interactions (Devillers et al., 2005). Also intensity and forms can vary. This is due to the level of both, self-restriction while interacting with a system and the system’s restriction in interaction context. A comprehensive study on human–human call-center emotion analysis and machine classification can be found in Vidrascu and Devillers, 2007 and Devillers et al., 2005. 3. Selected corpora Nearly all studies on anger recognition are based on single corpora making a generalization of the results difficult. Our aim in this study is to compare the performance of different features when trained and tested on different corpora. All of the selected databases account for real life conditions, i.e. they have background noise, recordings include cross- and off-talk, speakers are free in the choice of words and do not enunciate as clearly as trained speakers do. The German IVR database contains about 21 h of recordings from a German voice portal. Customers call in to report on problems, e.g. problems with the phone connection. The callers are pre-selected by an automated voice dialog before they are passed to an agent. The data can be subdivided into 4683 dialogs, averaging 5.8 turns per dialog. For each turn, three labelers assigned one of the following labels: not angry, not sure, slightly angry, clear anger, clear rage or marked the turns as non-applicable when encountering garbage. The labels were mapped onto two cover classes by clustering according to a threshold over the average of all voters’ labels as described by Burkhardt et al. (2009). Following the extension of Cohen’s kappa for multiple labelers by Davies and Fleiss (1982), we obtain a value of κ = 0.52, which corresponds to moderate inter-labeler agreement ( Steidl et al., 2005). Finally, our experimental set contains 1951 anger turns and 2804 non-anger turns which correspond approximately to a 40/60 split of anger/non-anger distribution. The average turn length after removing initial and final pauses is 1.8 s. A more detailed description of the corpus can be found in Burkhardt et al. (2009). The English IVR database originates from a US-American portal designed to solve Internet-related problems jointly with the caller. It helps customers to recover Internet connections, reset lost passwords, cancel appointments with service employees or reset lost e-mail passwords. If the system is unable to help the customer, the call is escalated to a human operator. Three labelers divided the corpus into angry, annoyed and non-angry utterances. The final label was defined based on majority voting resulting in 90.2% neutral, 5.1% garbage, 3.4% annoyed and 0.7% angry utterances. 0.6% of the samples in the corpus were eliminated because all three raters had different opinions. While the number of angry and annoyed utterances seems very low, 429 calls (i.e. 22.4% of all dialogs) contained annoyed or angry utterances. In order to be able to compare results of both corpora we matched the conditions of the English database to the conditions of the German database, i.e. we combined annoyed and angry to angry and created test and training sets according to the 40/60 split. The resulting set consists of 1560 non-anger and 1012 anger turns. The inter-labeler agreement results in κ = 0.63, which also represents moderate agreement. The average turn length after eliminating initial and final pauses is approximately 0.8 s. A more detailed description of the corpus can be found in Schmitt et al. (2010). The German WoZ AIBO database consists of children interacting with the AIBO robot dog. 51 children (age 10–13) were recorded in a Wizard-of-Oz scenario. The children were given the task to navigate the robot through a certain course of actions using voice commands. When the robot reacted disobediently, it provoked emotional reactions from the children. The data amounts to 9.2 h of 16 bit/16 kHz speech recordings in total. Five labelers annotated the utterances with respect to 10 emotion-related target classes, which were eventually mapped to a binary division between negative (NEG), subsuming touchy, angry, reprimanding and emphatic labels, and non-negative (IDL) utterances, subsuming all other classes, as described in ( Steidl et al., 2005). Recordings were split into chunks by syntactic-prosodic criteria. For the present experiments we chose a subset of 26 children, 2 which results in 3358 NEG and 6601 IDL chunks corresponding to a 33/66 split. Inter-labeler agreement results in κ = 0.56. A more detailed description of the corpus can be found in Steidl (2009). Details of all three corpora are listed in Table 1. While the IVR databases contain different degrees of anger expression in the anger class, the WoZ database also subsumes other emotion-related states. Thus, more diverse patterns in the WoZ anger class can be expected. Further, all samples from the selected databases were presented to the labelers chronologically and independently. This way, the history of a turn being part of a dialog course was known to the labelers, i.e. the label decision includes the context of it. The labelers of the IVR databases were familiar with the respective voice portals and linguistic emotion theory. The labelers of the WoZ database were advanced students of linguistics. Rating the turns or chunks, acoustic and linguistic information processing happened simultaneously, i.e. all stimuli were given in audible, not written form. In order to facilitate formal comparisons, we will refer to the NEG and IDL classes in the WoZ database as anger and non-anger classes and consider the given chunks as corresponding to turns.