قدرت شگفت آور یادگیری آماری: هنگامی که دانش قطعه ای منجر به حافظه کاذب کلمات بی سابقه می شود
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|32900||2009||16 صفحه PDF||سفارش دهید||14655 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Memory and Language, Volume 60, Issue 3, April 2009, Pages 351–367
Word-segmentation, that is, the extraction of words from fluent speech, is one of the first problems language learners have to master. It is generally believed that statistical processes, in particular those tracking “transitional probabilities” (TPs), are important to word-segmentation. However, there is evidence that word forms are stored in memory formats differing from those that can be constructed from TPs, i.e. in terms of the positions of phonemes and syllables within words. In line with this view, we show that TP-based processes leave learners no more familiar with items heard 600 times than with “phantom-words” not heard at all if the phantom-words have the same statistical structure as the occurring items. Moreover, participants are more familiar with phantom-words than with frequent syllable combinations. In contrast, minimal prosody-like perceptual cues allow learners to recognize actual items. TPs may well signal co-occurring syllables; this, however, does not seem to lead to the extraction of word-like units. We review other, in particular prosodic, cues to word-boundaries which may allow the construction of positional memories while not requiring language-specific knowledge, and suggest that their contributions to word-segmentation need to be reassessed.
Speech comes as a continuous signal, with no reliable cues to signal word boundaries. Thus learners have not only to map the words of their native language to their meanings (which is in itself a difficult problem), but first they have to identify the sound stretches corresponding to words. Thus, they need mechanisms that allow them to memorize the phonological forms of the words they encounter in fluent speech. Here we ask what kinds of memory mechanisms they can employ for this purpose. It is generally accepted that statistical computations are well suited for segmenting words from fluent speech, and thus for memorizing phonological word-candidates (e.g., Aslin et al., 1998, Cairns et al., 1997, Elman, 1990, Goodsitt et al., 1993, Hayes and Clark, 1970, Saffran, 2001b, Saffran et al., 1996, Saffran et al., 1996 and Swingley, 2005). However, as reviewed below in more detail, there is evidence, in particular from speech errors, that memory for words in fact appeals to different kinds of memory mechanisms, namely those encoding the positions of phonemes or syllables within words. We thus ask whether learners extract word-like units from fluent speech when just the aforementioned statistical cues are given, or whether they require other, possibly prosodic, cues that allow them to construct positional memories. Specifically, we presented participants with continuous speech streams containing statistically defined “words”. These words were chosen such that there were statistically matched “phantom-words” that, despite having the same statistical structure as words, never occurred in the speech streams. If statistical cues lead to the extraction of words from fluent speech, participants should know that they have encountered words but not phantom-words during the speech streams. In contrast, if memory for words is positional, participants should fail to prefer words to phantom-words when only statistical information is given. Rather such a preference should arise only once cues are available that lead to the construction of positional memories. Evidence for co-occurrence statistics as cues to word boundaries Once they reach a certain age, learners can use many different cues to predict word boundaries (e.g., Bortfeld et al., 2005, Cutler and Norris, 1988, Dahan and Brent, 1999, Jusczyk et al., 1993, Mattys and Jusczyk, 2001, Shukla et al., 2007, Suomi et al., 1997, Thiessen and Saffran, 2003 and Vroomen et al., 1998). However, many of these cues are language-specific, and thus have to be learned. For instance, if learners assume that strong syllables are word-initial, they will be right in Hungarian but wrong in French (where strong syllables are word-final), and to learn where stress falls in a word, they have to know the words in the first place. Hence, at least initially, language learners need to use cues to word-boundaries that do not require any language-specific knowledge. Co-occurrence statistics such as transitional probabilities (TPs) among syllables are one such cue that is particularly well-attested. These statistics indicate how likely it is that two syllables will follow each other. More formally, TPs are conditional probabilities of encountering a syllable after having encountered another syllable. Conditional probabilities like P (σi+1σi+1 = pet|σiσi = trum) (in the word trumpet) are high within words, and low between words (σσ denotes a syllable in a speech stream). Dips in TPs may give cues to word boundaries, while high-TP transitions may indicate that words continue. That is, learners may postulate word boundaries between syllables that rarely follow each other. Saffran and collaborators (e.g., Aslin et al., 1998 and Saffran et al., 1996) have shown that even young infants can deploy such statistical computations on continuous speech streams. After familiarization with speech streams in which dips in TPs were the only cue to word boundaries, 8-month-old infants were more familiar with items delimited by TP dips than with items that straddle such dips. Even more impressively, after such a familiarization, infants recognize the items delimited by dips in TPs in new English sentences pronounced by a new speaker ( Saffran, 2001b), suggesting that TP-based segmentation procedures may lead infants to extract word-like units. Results such as these have led to the widespread agreement that co-occurrence statistics are important for segmenting words from speech. Though not thought to be the only cues used for word-segmentation, they are thought to play a particularly prominent role because, unlike other cues, they can be used by infants without any knowledge of the properties of their native language (e.g., Thiessen & Saffran, 2003).1 Moreover, similar computations have been observed with other auditory and visual stimuli (Fiser and Aslin, 2002, Saffran et al., 1999 and Turk-Browne et al., 2005), and with other mammals (Hauser et al., 2001 and Toro and Trobalón, 2005). Such computations may thus be domain- and species-general, stressing again the potential importance of such processes for a wide array of cognitive learning situations. Accordingly, some authors have proposed that these processes may be crucial not only for word-learning but also for other, more grammatical aspects of language acquisition (Bates and Elman, 1996, Saffran, 2001a and Thompson and Newport, 2007). Surprisingly, however, there is no evidence that TP-based computations lead to the extraction of word-candidates. The experiments above have provided numerous demonstrations that participants are more familiar with items with stronger TPs than with items with weaker TPs. This, however, does not imply that the items with stronger TPs are represented as actual word-like units, or even that they have been extracted. For example, one may well find that a piece of cheese is more associated with a glass of wine than with a glass of beer, but this does not imply that the wine/cheese combination is represented as a unit for parsing the visual scene. Likewise, choosing items with stronger TPs (where the syllables have stronger associations) over items with weaker TPs does not imply either that the items with stronger TPs have been extracted as perceptual units. The distinction between a preference for high-TP items and representing these items as perceptual units is well illustrated in Turk-Browne and Scholl (2009) studies of visual statistical learning. In these experiments, participants saw a continuous sequence of shapes. This sequence was composed of a concatenation of three-shape items (just as the experiments reviewed above used concatenations of three-syllable non-sense words). Following such a familiarization, participants were as good at discriminating high-TP items from low-TP items when the items were played forward (that is, in the temporal order in which they had been seen during familiarization) as when they were played backwards. If a preference for high-TP items implied that these items have been extracted and memorized, one would have to conclude that participants have extracted also the backwards items although they had never seen them. It thus seems that a preference for high-TP items does not imply that these items have been memorized. There are also other reasons to doubt that TPs may play an important role in word-segmentation. One reason is that computational studies using TPs (or related statistics) for segmenting realistic corpora of child-directed speech have encountered mixed success at best (e.g., Swingley, 2005 and Yang, 2004). At minimum, TPs thus have to be complemented with other cues. This seems highly plausible, given that one would certainly not expect a single cue to solve a highly complex problem such as speech-segmentation. While the poor performance of word-segmentation mechanisms based on TPs can be improved by the inclusion of other cues, there is a second, more fundamental, reason for doubting that TPs play an important role in word-segmentation. This reason is related to the kinds of representations that are formed of acoustic word-forms. Presumably, the purpose of word-segmentation is to store phonological word-candidates in long-term memory. As these are essentially sound sequences (or sequences of articulatory gestures according to a direct realist perspective), it is reasonable to ask whether research on sequential memory can constrain the kinds of cues that can be used for word-segmentation. This issue is addressed in the next section. Memory mechanisms for acoustic word forms Research on sequential memory has revealed (at least) two kinds of mechanisms for remembering sequences (for a review, see e.g., Henson, 1998). One mechanism is referred to as “chaining memory.” When memorizing the sequence ABCD using such a mechanism, one would learn that A goes to B, B to C, and C to D. In other words, this mechanism is fundamentally similar to TPs. There is another mechanism, however, that appeals to the sequential positions of items. For example, people often remember the first and the last element of a sequence — but not the intervening items. Chaining memories do not easily account for such results — because the “chain” is broken in the sequence middle. Positional mechanisms, in contrast, readily account for such results: People may memorize the item that occurred in the first and the last position without remembering items in intervening positions. These (and, in fact, many other) results are thus readily explained if a distinction between positional and chaining memories is assumed (e.g., Conrad, 1960, Henson, 1998, Henson, 1999, Hicks et al., 1966, Ng and Maybery, 2002 and Schulz, 1955). This distinction has also been observed in artificial grammar learning experiments. In such experiments, TPs and positional regularities seem to require different kinds of cues, to have different time courses, and to break down under different conditions ( Endress and Bonatti, 2007, Endress and Mehler, in press and Peña et al., 2002). In these experiments, participants were familiarized to speech streams. The streams contained both chaining and positional regularities. Following familiarization, participants had to choose between items that instantiated the chaining regularity, the positional regularity or both. Most relevant to the current experiments, participants were sensitive to the positional regularity only when the familiarization stream contained prosodic-like cues such as silences between words. TPs, in contrast, were tracked also in the absence of such cues. It thus appears that both positional and chaining memories can be learned from speech streams by independent mechanisms, but that positional memories require additional, perhaps prosodic cues. Interestingly, a similar distinction between positional and chaining information has been proposed in artificial grammar learning experiments in the tradition developed by Miller, 1958, Reber, 1967 and Reber, 1969 (although these experiments typically use simultaneously presented letter strings rather than sequences). In such experiments, participants are exposed to consonant strings governed by a finite-state grammar, and then have to judge whether new strings are grammatical. It now seems clear that participants acquire distributional information of the consonants of various kinds, including legal bigrams (which, we would argue, corresponds to chaining information; see e.g., Cleeremans and McClelland, 1991, Dienes et al., 1991, Kinder, 2000 and Kinder and Assmann, 2000) and the positions of legal letters and bigrams within the strings (which may correspond to the positional information mentioned above; see e.g. Dienes et al., 1991, Johnstone and Shanks, 1999 and Shanks et al., 1997, but see Perruchet & Pacteau, 1990). Whilst these experiments were not necessarily optimized to distinguish chaining and positional information, it is interesting to note that a similar distinction has also been proposed in this literature. What kinds of memory mechanisms are used for words? There is some evidence from speech errors that word memory has at least a strong positional component. With the tip of the tongue experience, for instance, people often remember the first and the last phoneme of a word, but not the middle phonemes (e.g., Brown and McNeill, 1966, Brown, 1991, Kohn et al., 1987, Koriat and Lieblich, 1974, Koriat and Lieblich, 1975, Rubin, 1975 and Tweney et al., 1975). Such observations are hard to explain if memory for words relies upon chaining memories, since such chains would be broken in the middles of words. In contrast, they naturally follow if one assumes that words rely on positional memories. Likewise, spoonerisms (that is, reversals in the order of phonemes such as in “queer old dean”, from “dear old queen”) often conserve the serial position in words and syllables of the exchanged phonemes (e.g., MacKay, 1970). Again, this would be unexpected if words were remembered by virtue of chaining memories (because positions are not encoded in such memories), but it is easily explained if word memory has a positional component. If memory for acoustic word forms is positional, cues to chaining memories such as TPs may not enable participants to extract words from fluent speech. Rather, learners may require other cues such as those that have triggered positional computations in other artificial language learning studies (Endress & Bonatti, 2007). Here, we thus return to the original motivation for TP-based processes, and examine their potential for the first step in word-learning, namely word-segmentation. (In the following, we will use word-learning and word-segmentation interchangeably. We thus hypothesize that the role of a word-segmentation mechanism is to provide candidates for phonological word forms, but are agnostic as to how such forms may become linked to meaning.) At the very least, if TP-based learning mechanisms are used for word-learning, one would expect the output of these mechanisms (that is, presumably phonological word candidates) to make learners more familiar with items they heard frequently than with items they never heard at all. After all, a word-segmentation mechanism should learn the words contained in its input, and not some syllable combination it has never encountered at all.