تشخیص مشکل در تعاملات انسان و ماشین بر اساس حالات چهره کاربران
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
37666 | 2005 | 17 صفحه PDF |

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Speech Communication, Volume 45, Issue 3, March 2005, Pages 343–359
چکیده انگلیسی
Abstract This paper describes research into audiovisual cues to communication problems in interactions between users and a spoken dialogue system. The study consists of two parts. First, we describe a series of three perception experiments in which subjects are offered film fragments (without any dialogue context) of speakers interacting with a spoken dialogue system. In half of these fragments, the speaker is or becomes aware of a communication problem. Subjects have to determine by forced choice which are the problematic fragments. In all three tests, subjects are capable of performing this task to some extent, but with varying levels of correct classifications. Second, we report results of an observational analysis in which we first attempt to relate the perceptual results to visual features of the stimuli presented to subjects, and second to find out which visual features actually are potential cues for error detection. Our major finding is that more problematic contexts lead to more dynamic facial expressions, in line with earlier claims that communication errors lead to marked speaker behaviour. We conclude that visual information from a user’s face is potentially beneficial for problem detection.
مقدمه انگلیسی
1. Introduction The goal of the investigation presented in this article is to explore to what extent it could be beneficial to use features of a user’s facial expression to detect communication problems in his or her interactions with a spoken dialogue system. It is well-known that managing communication problems in spoken human–computer interaction is difficult. One key issue is that spoken dialogue systems are not good at determining whether the communication is going well or whether communication problems arose (e.g., due to poor speech recognition or false default assumptions). The occurrence of problems negatively affects user satisfaction (Walker et al., 1998), but also has an impact on the way users communicate with the system in subsequent turns, both in terms of their language and speech. For instance, when users notice that a system has difficulties to handle their prior spoken input, they tend to produce utterances with marked linguistic features (e.g., longer sentences, marked word order, more repeated information, etc.) (Krahmer et al., 2002). In addition, human speakers also respond in a different vocal style to problematic system prompts than to unproblematic ones: when speech recognition errors occur, they tend to correct these in a hyperarticulate manner (which may be characterized as longer, louder and higher). This generally leads to worse recognition results (‘spiral errors’), since the standard speech recognizers are trained on normal, non-hyperarticulated speech (Oviatt et al., 1998, Levow, 2002 and Hirschberg et al., 2004), although more recent studies suggest that systems become less vulnerable to hyperarticulation (Goldberg et al., 2003). In a similar vein, when speakers respond to a problematic yes–no question, their denials (“no”) share many of the properties typical of hyperarticulate speech, in that they are longer, louder and higher than unproblematic negations (Krahmer et al., 2002). In other words, one could state that dialogue problems lead to a marked interaction style of users, which manifests itself partly in a set of prosodic correlates. Based on these observations, it has been suggested that monitoring prosodic aspects of a speaker’s utterances may be useful for problem detection in spoken dialogue systems. It has indeed been found that using automatically extracted prosodic features helps for problem detection (e.g., Hirschberg et al., 2004 and Lendvai et al., 2002). While this has led to some improvements, the extent to which prosody is beneficial differs across studies. Moreover, in all these studies a sizeable number of problems is not detected. In general, it appears that the detection of errors improves if prosodic features are used in combination with other features already available to the system, such as more traditional acoustic or semantic confidence scores, knowledge about the dialogue history, or the grammar being used in a particular dialogue state (Litman et al., 2001, Bouwman et al., 1999, Hirschberg et al., 2001, Danieli, 1996 and Ahrenberg et al., 1993). The current paper explores whether it is potentially useful to include yet another set of features, i.e., visual features from the face of the user who is interacting with the computer. Indeed, it makes sense to assume that a speaker’s facial expressions may signal communication problems as well. One obvious reason is that hyperarticulation is likely to be detectable from inspecting more exaggerated movements of the articulators. Erickson et al. (1998) found that speakers’ repeated attempts to correct another person are highly correlated with more pronounced jaw movements, which are likely to be clearly visible to their addressees (see also Gagné et al., 2004; or Dohen et al., 2003 about related visual correlates of contrastive stress). In addition, in line with the earlier observation that speakers adapt their language and speech after communication errors to a more marked interaction style, there is evidence that speakers also change their facial expressions in problematic dialogue situations. Swerts et al. (2003) applied the so-called Feeling-of-Knowing paradigm (Hart, 1965, Smith and Clark, 1993 and Brennan and Williams, 1995) to investigate how speakers cue that they are certain or rather uncertain about a response they give to a general factual question. It was found that it is indeed often clearly visible when people were insecure about the answer to a response, in that speakers show much more deviations from “normal” facial expressions (e.g., more eyebrow movement and gaze acts). Given such observations, it is worthwhile to investigate whether speakers also exhibit special visual expressions when they are confronted with communication problems in spoken human–machine interactions. This research fits in a recent interest to try and integrate functional aspects of facial expressions in multimodal systems, with the ultimate goal to make the interaction with such systems more natural and efficient. Some systems already supplement their interface with an embodied conversational agent (ECA), for instance in the form of a synthetic head, to support the communication process with users. Visual cues of such ECA’s appear to be functionally relevant in more than one respect. They make the speech more intelligible (e.g., Agelfors et al., 1998; see also Jordan and Sergeant, 2000), and can give clues about the status of the information a system sends to the user, for instance to signal the difference between negative or positive feedback responses from a system (Granström et al., 2002). An additional advantage of using a synthetic face is that it can give silent cues about the internal state of the system, e.g., to signal that it is paying attention to the user or that it is looking for information, following the general best practice to make a system’s behavior and reasoning clear to a user (Sengers, 1999). The perspective in the current paper is different from that of such earlier studies in that it does not concentrate on multimodal features of system utterances, but rather deals with analyses of the users’ facial expressions. The exploitation of the users’ auditory and visual cues is becoming a real possibility in advanced multimodal spoken dialogue systems (see e.g., Benoit et al., 2000), which combine speech recognition with facial tracking. Earlier work in bimodal speech recognition has shown that using automatic lipreading in combination with more standard automatic speech recognition techniques leads to a reduction of the number of recognition errors (see e.g., Petajan, 1985). In addition, comparable to the silent visual cues from a system, facial expressions of a user may indicate communication problems even when the person is not speaking, but for instance when (s)he becomes aware of a communication problem during the system’s feedback. Such cues clearly have added value compared to the auditory and linguistic cues to errors used before, because they would enable a very early detection of problems. Obviously, this would be useful from a system’s point of view, since the sooner a problem can be detected, the earlier a repair strategy may be started (e.g., a re-ranking of recognition hypotheses or a modification of the dialogue strategy). Therefore, the general goal of the research described in this paper is to investigate the information value of a speaker’s visual cues for problem detection in spoken human–machine interaction. The study consists of two parts. First, we describe three perception experiments in which subjects were shown selected recordings of Dutch speakers engaged in a telephone conversation with a train timetable information system.1 The recordings constituted minimal pairs as they were very comparable in terms of their words and syntactic structure but differed in that they were excised from a context which was either problematic or not. The recordings were presented without the original context to subjects who had to determine whether the preceding speaker utterance had led to a communication problem or not. The first experiment focuses on subjects’ responses during verification questions of the system (i.e., when subjects listen in silence), which either verify correct or misrecognized information. The second experiment concentrates on speakers uttering “no”, either in response to a problematic or an unproblematic yes–no question from the system. The third experiment, finally, is devoted to speakers uttering a destination station (filling a slot), either for the first time (no problem) or as a correction (following a recognition error). The descriptions of these three studies are preceded by an overview of the general experimental procedure. The second part of the paper describes the results of some observational analyses. We attempt to find visual correlates of problematic situations that could have functioned as cues to subjects in the different perception studies described in part 1. Our major finding is that more problematic contexts lead to more dynamic facial expressions, in line with earlier claims that communication errors lead to marked speaker behaviour. We conclude our paper with a general discussion and some perspectives on further research.
نتیجه گیری انگلیسی
.