نقش آهنگ گفتار در ابراز هیجانی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|37936||2005||16 صفحه PDF||سفارش دهید||8535 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Speech Communication, Volume 46, Issues 3–4, July 2005, Pages 252–267
Abstract The influence of emotions on intonation patterns (more specifically F0/pitch contours) is addressed in this article. A number of authors have claimed that specific intonation patterns reflect specific emotions, whereas others have found little evidence supporting this claim and argued that F0/pitch and other vocal aspects are continuously, rather than categorically, affected by emotions and/or emotional arousal. In this contribution, a new coding system for the assessment of F0 contours in emotion portrayals is presented. Results obtained for actor portrayed emotional expressions show that mean level and range of F0 in the contours vary strongly as a function of the degree of activation of the portrayed emotions. In contrast, there was comparatively little evidence for qualitatively different contour shapes for different emotions.
1. Introduction This paper examines the contribution of intonation to the vocal expression of emotions. Over the past decades, this question has been addressed by many authors from different research backgrounds and is still a matter of sustained debate. A tradition, emerging from the linguistic approach to the study of intonation contours, has claimed the existence of emotion specific intonation patterns (e.g. Fonagy and Magdics, 1963). However, the evidence offered for this notion consists mostly of selected examples rather than of empirical examination of emotional speech recordings. Efforts to describe/analyze the intonation of actual emotional expressions have been limited by the use of simplified descriptors, such as measures of overall pitch level, pitch range or overall rise/fall of pitch contours. Some authors have directly questioned the existence of emotion specific intonation patterns. Pakosz (1983), for instance, claimed that intonation only carries information about the level of emotional arousal. In this perspective, elements of the context in which the expressions are produced and/or information carried by other channels (typically facial expressions) are required to disambiguate specific emotion categories. In the following paragraphs, more details on (1) the linguistic approach to the description/analysis of intonation and (2) some results obtained on the basis of empirical analysis of emotional speech are reviewed. The limits of those approaches for the study of the intonation of emotional speech are introduced; and, finally, the approach used in the study presented in this paper will be outlined. 1.1. The linguistic approach to the description/analysis of intonation Various definitions of concepts such as intonation or prosody have been proposed by authors working on the analysis and description of nonverbal features of running speech. Cruttenden (1986, pp. 2–3), proposed the following definition: “The prosody of connected speech may be analysed and described in terms of the variation of a large number of prosodic features. There are, however, three features which are most consistently used for linguistic purposes either singly or jointly. These three features are pitch, length, and loudness. […] Pitch is the prosodic feature most centrally involved in intonation and it is with this feature that I shall be principally concerned in this book.” As in the above citation, the definition of the term intonation generally includes aspects related to pitch, length and loudness; whereas, somewhat paradoxically, most authors focusing on intonation described and analyzed essentially perceived pitch contours. Transcriptions of pitch contours were first developed to account for linguistic functions of intonation, often with a didactic purpose. A great variety of transcription systems have been proposed over time. Thirty years ago, Léon and Martin (1970, pp. 26–32) distinguished, for instance, six forms of pitch transcription, including “musical transcriptions”, “transcriptions of patterns of intonation” and “transcriptions representing levels of intonation”. More recently, different models have been proposed for the linguistic analysis and description of intonation (perceived pitch). A broad distinction can be made between tone sequence models (such as the Tones and Break Indices system, ToBI, Silverman et al., 1992)—which describe pitch as a sequence of (high/low) tones on specified targets—and superpositional models—which define the overall pitch contour as the superposition of hierarchically ordered components. The most prominent superpositional model (Fujisaki, 1988), includes two components: a phrase component (i.e. a contour defined at the phrase level) and an accent component (i.e. local excursions superposed to the phrase contour). Superpositional models allow to account for phenomena such as global pitch declination or anticipation effects in the production of overall pitch contours. But, independently of the number and the quality of the components included, superpositional models remain relatively abstract. With respect to actual pitch contours, one or more component(s) need to be fixed according to a set of more or less arbitrary rules, allowing to define the other component(s). Furthermore, relationships between those relatively abstract components of pitch contours and linguistic or paralinguistic functions are difficult to specify. In recent years, tone sequence models have been more extensively used than superpositional models for the description and analysis of linguistic intonation (perceived pitch). ToBI (Tones and Break Indices system, Silverman et al., 1992)—a pitch transcription system originally derived from the tone sequence model developed for the intonation of English by Pierrehumbert (1980)—has been adapted and used extensively for the description of perceived pitch in several languages. In this coding system contrastive tone values (high/low) are attributed to linguistically defined targets. Relative pitch levels (tones) are allocated to accented syllables (pitch accents) and to intonative boundaries (phrasal tones, final boundary tones). Linguistic models of intonation rely, more or less explicitly, on linguistic segmentations of the speech flow and their primary purpose is to describe linguistic functions of intonation. To our knowledge, with one exception,1 such models have not been systematically applied to the description and analysis of actual corpuses of emotional expressions. Accordingly, the possibility for those models to account for variations of intonation related to emotional expressions remains largely untested. Nonetheless, the strongest claims supporting the existence of emotion specific intonation patterns originate in the linguistic approach to the description of intonation contours. Various authors (e.g. Fonagy and Magdics, 1963, Halliday, 1970 and O’Connor and Arnold, 1973)—using different transcription models—proposed descriptions of pitch contours for specific emotions. Fonagy and Magdics’ approach (1963) provides a good illustration of this approach. They described perceived pitch contours using sequences of tones on a musical score for various utterances corresponding to different emotional situations. The French utterance “Comme je suis heureuse de te voir! Je ne pensais pas te rencontrer!” is for instance reported to illustrate a typical joyful contour. Frick (1985) voiced several objections relatively to this kind of approach. In particular, he pointed out that the verbal content of the utterances used as examples often carries the meaning that the pitch contour is supposed to convey (as in the example borrowed from Fonagy and Magdics) and that the emotional impression a reader derives from the example is likely to arise from the implicit addition of prosodic elements unspecified in the pitch contour transcriptions provided by the author(s). Empirical studies on the contribution of intonation to the communication of emotional meaning have been largely carried out independently of the models (or transcriptions) proposed in the research tradition described above. An overview of the evidence gathered on this issue is presented in the following section. 1.2. Empirical studies on the contribution of intonation to the communication of emotion A large number of studies have investigated vocal correlates of emotional expressions (for recent reviews see Juslin and Laukka, 2003; and Scherer, 2003). A common finding in those studies is that portrayed emotions influence global descriptors of F0, such as average F0, F0 level or F0 range. Reviews in this field show that portrayed emotions also have an effect on other broad descriptors of intonation, in particular measures derived from the acoustic intensity contour and measures related to the relative duration of various speech segments. Comparably few studies have attempted to describe F0 contours for different emotional expressions. In their review, Juslin and Laukka (2003) examined 104 studies concerned with the vocal communication of emotion, 77 studies reported acoustic descriptions for various expressed and/or perceived emotions, 69 studies used overall descriptors or manipulations of F0, and only 25 studies included a description of F0 contours or an attempt to influence emotional attributions through the systematic manipulation of F0 contours. When descriptions of F0 contours are provided they often come down to global “rising” or “falling” of the overall contour shape. According to Juslin and Laukka’s review, rising contours were reported in 6 out of 8 studies for anger expressions, in 6 out of 6 studies for fear expressions, and in 7 out of 7 studies for joy expressions. Falling contours were reported in 11 out of 11 studies for sadness expressions and in 3 out of 4 studies for tenderness expressions. This synthesis of the results described in the literature shows that studies reporting empirical results regarding contour shapes for emotional expressions are relatively scarce. It also reflects both the over-simplification and the heterogeneity of the descriptions used to characterize emotional intonation contours. The descriptions provided are often very rudimentary (such as global rise/fall) and more specific aspects reported in different studies can hardly be compared between studies. However, evidence for the participation of intonation (pitch/F0 contours) in the communication of emotional meaning can be derived from studies that have not attempted to describe specific contours for specific emotions, but primarily tried to assess the respective importance of intonation and voice quality (voice timbre) for the communication of emotional meaning. Starting in the early sixties, several authors tried to separate the contribution of voice quality and of intonation by using various signal manipulation techniques (see for example Scherer et al., 1985). The following paragraphs describe the results obtained by two studies (Ladd et al., 1985 and Lieberman and Michaels, 1962) demonstrating that isolated aspects of intonation can contribute to the communication of emotional meaning and two further studies (Scherer et al., 1984 and Uldall, 1964) indicating that intonation patterns might influence emotional or attitudinal attributions in combination with the linguistic content of the expressions. Lieberman and Michaels (1962) used expressions corresponding to eight emotional modes,2 85% of the recorded expressions were correctly identified by a group of listeners. The authors extracted the F0 contours of the original expressions and resynthesized them on a fixed-vowel. The proportion of “emotional modes” correctly identified dropped in this condition, but 44% of the expressions synthesized with the copied F0 contours were still correctly identified. When smoothing the F0 contours with a 40 ms time constant, this proportion further dropped to 38%; 100 ms smoothing reduced the recognition rate to 25%. This study indicates that F0 fluctuations can carry emotional meaning, independently from amplitude variations and voice quality. It also shows that short-time variations of F0 contours might be of importance in this process. Lieberman and Michaels (1962) attribute the drop in recognition rate introduced by the smoothing of the contours to the presence of micro-perturbations (“jitter”) in some expressions. Those micro-perturbations of the F0 contours allowed to differentiate some expressions of joy and fear, which were mistaken only when the resynthesized contours were smoothed. In this study, more long term variations of the F0 contours—comparable to those that would be captured by a linguistic transcription model—also allowed to differentiate the “emotional modes” portrayed but to a less important extent (only one expression out of four is still correctly categorized by the listeners in this condition). In a series of three studies using resynthesized speech, Ladd et al. (1985) assessed the effect of a combined manipulation of F0 contour shape (“uptrend” versus “downtrend”) and stepwise increase of F0 range on emotional attributions. They found that contour shape, F0 range and also voice quality (one speaker produced expressions using two different phonation modes) independently influenced emotional ratings. They also reported that the progressive increase of F0 range affected emotional intensity ratings to a greater extent than the manipulation of the contour shape. Uldall (1964) applied 16 stylized F0 contours on five utterances. She showed that the emotional meaning attributed to different contours varies depending on the sentence carrying the contour. She found for instance that the contour featuring a weak declination and a low level is rated as ‘unpleasant’, ‘authoritative’, and corresponding to a ‘weak’ emotional intensity when applied to the two types of questions and the statement used in this study. The same contour is rated as ‘unpleasant’, ‘authoritative’ and corresponding to a ‘strong’ emotional intensity when applied to the command utterance. Uldall identified only few contour features that were linked to the three dimensions (valence, power, intensity) underlying the emotional ratings of the participants in this study, independently of the three sentence types carrying the contours. Likewise, Scherer et al. (1984) found that specific combinations of prosodic features (final rise or fall of F0 contours) and linguistic categories (Wh-questions or yes/no questions) influence the attributions of affect-loaded attitudes (such as “challenging”, “agreeable” or “polite”). A final fall of the F0 contour will for example be perceived as “challenging” on a yes/non question (where a final rise is expected form the syntactic structure) but not on a Wh-question. In this study as well, perceived emotional intensity was affected mainly by the continuous variation of F0 level. The authors suggested that vocal aspects covarying with emotional attributions (such as F0 level in this study) might mainly reflect and communicate the physiological arousal associated to the emotional reaction, whereas configurations of prosodic features (such as F0 contour shapes) would be used to signal specific attitudes in association with the linguistic content of the utterances. On the whole, the literature does not provide strong evidence for the existence of emotion specific intonation patterns. Nevertheless, intonation (or more specifically F0 fluctuations) seem to be affected to some extent by the emotional state of the speakers and appear to carry information that can be used by listeners to generate inferences about the emotional state of the speakers, more or less independently of the linguistic features of the expressions. The notion of emotional intonation being produced and processed independently of the linguistic aspects of speech is further supported by neuropsychological studies conducted to differentiate structures involved in the processing of linguistic intonation, on one hand, and emotional intonation, on the other hand (e.g. Heilman et al., 1984, Pell, 1998, Ross, 1981 and van Lancker and Sidtis, 1992).3 Furthermore, studies investigating the prelinguistic production and perception of intonation in infants suggest that emotional meaning can be communicate through modulations of intonation before language is acquired (Fernald, 1991, Fernald, 1992, Fernald, 1993 and Papousek et al., 1991). Altogether then, despite the relative lack of empirical evidence, there are strong claims supporting the notion that F0 contours can carry emotional meaning independently of linguistic structures. The study we present in this paper introduces a new, more resolutely quantitative approach to the description of F0 fluctuations in emotional speech including a more elaborate description of F0 contours than the aggregated F0 descriptors mostly used in empirical studies on vocal correlates of emotional expressions.
نتیجه گیری انگلیسی
3. Conclusions The results of our quantitative prosodic analysis of a large corpus of vocal emotion portrayals indicate that the level of arousal underlying portrayed emotions essentially affected the global level and range of F0 contours. Therefore, simple summaries of F0 contours—such as F0 mean or F0 range—were sufficient to account for the most important variations observed between emotion categories. However, a more detailed examination of the contours revealed specific differences for some portrayed emotions. For some emotional expressions—especially hot anger (HA anger), cold anger (LA anger), and elation (HA joy)—the second F0 excursion in the utterances tended to be larger than for other emotions—such as sadness (LA sad) or happiness (LA joy), which showed much smaller F0 excursions in the second part of the utterances. This difference could not be explained entirely by the overall difference in F0 range for those expressions. The “shape” of the contours was only slightly affected by the portrayed emotions. Contours with “uptrend” shape (a term borrowed from Ladd et al., 1985)—i.e., contours featuring a progressive increase of F0 and upholding a high level of F0 until the final fall—were observed for expressions of despair (HA sad) and elation (HA joy), whereas expressions of sadness (LA sad) and happiness (LA joy) showed a “downtrend” movement of F0—an early F0 peak followed by a progressive decrease until the final fall. The final fall itself might also be affected by portrayed emotions. Emotions such as hot anger (HA anger) or elation (HA joy) might result in steeper final falls than expressions of anxiety (LA fear) or happiness (LA joy). The results regarding the relative height of local F0 excursions, contour “shape,” and final fall must be considered with caution. The variations within portrayed emotions were always large in the corpus we examined and the number of expressions analyzed was relatively small. Consequently, those results need to be replicated before they can be generalized. The coding of F0 contours used in this study might have cancelled out potentially important aspects of the F0 contours. Specific configurations of F0 contour features and syntactic aspects (as described in (Scherer et al., 1984); or (Uldall, 1964)), semantic aspects, or even phonetic aspects of the expressions might contribute to the expression and the communication of emotional meaning. The expressions considered in this study were free of semantic and syntactic content that might have allowed to communicate an emotional impression in interaction with F0 contour features. Still the local F0 excursions (“accents”) observed on those expressions were produced with large variations regarding their precise location in the utterances. These variations were cancelled out by the coding of the F0 contours. Therefore, the possibility remains that a more thorough examination of the position of the F0 excursions relatively to the phonetic content of the utterances might allow to identify emotion specific differences. On the other hand, it should be noted that, in the absence of syntactic or semantic constraints, the actors were free to choose the contour that would have seemed best suited to convey a particular emotional feeling. The fact that they did not systematically produce such emotion-specific contours for those short utterances seems to indicate that emotions considered independently of linguistic context do not provide for contour coding other than the general level, range, and final fall parameters described earlier. As mentioned earlier, these results certainly need to be replicated and it would probably be useful to include a number of utterances that do have linguistic structure and meaning to compare with the kinds of quasi-speech stimuli that we have been using. It would also be beneficial to systematically record portrayals of affect bursts (Schröder, 2003). As mentioned earlier, one would need to agree on an intonation coding system that respects both the needs of statistical analysis and fundamental aspects of contour shape, without getting into the subtleties of the debates between schools in linguistics and phonology. Obviously, it would be useful if such a system worked mostly automatically, with hand correction. Once we have the appropriate corpus, preferably produced with actors from different cultures and language groups, we could use some of the techniques for signal masking and feature destruction that allow us to determine which aspects of a signal need to be retained to carry recognizability. The fact that, in the past, random-splicing procedures (which destroy intonation and sequential information but keep voice quality) have worked better, in the sense of preserving recognition accuracy, than content-filtering methods (which keep intonation but mask essential aspects of voice quality; Scherer et al., 1985) suggests that intonation contours (at least in terms of shape) may be less important signatures of emotions than global F0 level and variation and spectral aspects of voice quality. Finally, emotional speech synthesis should be the method of choice to systematically test the hypotheses that have been obtained by the more exploratory methods. Although the commercial interest in affect-rich multimodal interfaces has led to a multiplication of emotion synthesis studies, to our knowledge few have advanced to a satisfactory level of ecological validity. All too often, such work either is not based on hypotheses informed by earlier work, or suffers from serious methodological shortcomings (e.g., inflated recognition rates due to a limited number of categories and failure to distinguish simple discrimination from pattern recognition). One of the major problems is that engineers and phoneticians, but unfortunately also some psychologists, tend to think that emotions are easy to understand and to manipulate and that we understand them because we experience them ourselves. Nothing could be further from the truth. The vocal expression of emotion may be one of the most complex systems of communication there is, certainly much more complex than facial expression. In consequence, advances in the field should rely, much more than in the past, on close collaboration between phoneticians, speech scientists, engineers, and psychologists.