بهره برداری نظارت نشده و نظارت شده از حوزه های معنایی در رفع ابهام از واژگان
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
20258 | 2004 | 25 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computer Speech & Language, Volume 18, Issue 3, July 2004, Pages 275–299
چکیده انگلیسی
Domains are common areas of human discussion, such as economics, politics, law, science, etc., which are at the basis of lexical coherence. This paper explores the dual role of domains in word sense disambiguation (WSD). On one hand, domain information provides generalized features at the paradigmatic level that are useful to discriminate among word senses. On the other hand, domain distinctions constitute a useful level of coarse grained sense distinctions, which lends itself to more accurate disambiguation with lower amounts of knowledge. In this paper we extend and ground the modeling of domains and the exploitation of WordNet Domains, an extension of WordNet in which each synset is labeled with domain information. We propose a novel unsupervised probabilistic method for the critical step of estimating domain relevance for contexts, and suggest utilizing it within unsupervised domain driven disambiguation for word senses, as well as within a traditional supervised approach. The paper presents empirical assessments of the potential utilization of domains in WSD at a wide range of comparative settings, supervised and unsupervised. Following the dual role of domains we report experiments that evaluate both the extent to which domain information provides effective features for WSD, as well as the accuracy obtained by WSD at domain-level sense granularity. Furthermore, we demonstrate the potential for either avoiding or minimizing manual annotation thanks to the generalized level of information provided by domains.
مقدمه انگلیسی
Domains are common areas of human discussion, such as economics, politics, law, science, etc. (see Table 1), which demonstrate lexical coherence. A substantial portion of the language terminology may be characterized as domain words whose meaning refers to concepts belonging to specific domains, and which often occur in texts that discuss the corresponding domain. Table 1. Domain distribution over WordNet synsets Domain #Syn Domain #Syn Domain #Syn Factotum 36820 Biology 21281 Earth 4637 Psychology 3405 Architecture 3394 Medicine 3271 Economy 3039 Alimentation 2998 Administration 2975 Chemistry 2472 Transport 2443 Art 2365 Physics 2225 Sport 2105 Religion 2055 Linguistics 1771 Military 1491 Law 1340 History 1264 Industry 1103 Politics 1033 Play 1009 Anthropology 963 Fashion 937 Mathematics 861 Literature 822 Engineering 746 Sociology 679 Commerce 637 Pedagogy 612 Publishing 532 Tourism 511 Computer_Science 509 Telecommunication 493 Astronomy 477 Philosophy 381 Agriculture 334 Sexuality 272 Body_Care 185 Artisanship 149 Archaeology 141 Veterinary 92 Astrology 90 Table options Domains have been used with a dual role in linguistic description. One role is characterizing word senses, typically as the semantic field of a word sense in a dictionary or lexicon (e.g. crane has senses in the domains of Zoology and Construction). The WordNet Domains lexical resource is an extension of WordNet which provides such domain labels for all synsets ( Magnini and Cavaglià, 2000). A second role is to characterize texts, typically as a generic level of text categorization (e.g. for classifying news and articles) ( Sebastiani, 2002). From the perspective of word sense disambiguation domains may be considered from two points of view. First, a major portion of the information required for sense disambiguation corresponds to paradigmatic domain information. Many of the features that contribute to disambiguation identify the domains that characterize a particular sense or subset of senses. For example, economics terms provide characteristic features for the financial senses of words like bank and interest, while legal terms characterize the judicial sense of sentence and court. Common supervised WSD methods capture such domain-related distinctions separately for each sense of each word, and may require relatively many training examples in order to obtain sufficiently many features of this kind for each sense ( Yarowsky and Florian, 2002). However, domains represent an independent linguistic notion of discourse, which does not depend on a specific word sense. Therefore, it is beneficial to model a relatively small number of domains directly, as a generalized notion, and then use the same generalized information for many instances of the WSD task. A major goal of this paper is to study the extent to which domain information can contribute along this vein to WSD. Second, domains may provide a useful coarse-grained level of sense distinctions. Many applications do not benefit from fine grained sense distinctions (such as WordNet synsets), which are often impossible to detect by WSD within practical applications (i.e. some verbs in WordNet have more than 40 sense distinctions). However, applications such as information retrieval (Gonzalo et al., 1998) and user modeling for news web sites (Magnini and Strapparava, 2001) can benefit from sense distinctions at the domain level, which are substantially easier to establish in practical WSD. The work by Magnini et al. (2002) has presented initial results in utilizing WordNet Domains information for WSD (at the WordNet synset sense granularity level). In this paper we substantially extend and ground domain modeling and the utilization of WordNet Domains in several ways. At the algorithmic level, we present a novel unsupervised method for estimating domain relevance for word contexts, which is grounded in a probabilistic framework utilizing Gaussian Mixtures and EM estimation. The unsupervised estimation framework, which is very attractive for WSD, is made possible thanks to the dual nature of domains being both sense and text descriptors. This enables us to use only the available lexical resource of WordNet Domains without requiring annotated examples. The focus of this paper is not about the absolute performance of a particular new WSD system, but rather to investigate and assess the potential utilization of domains in WSD at a wide range of comparative settings, both supervised and unsupervised. In particular we report experiments that evaluate: • the extent to which domain information provides effective features for WSD; • the accuracy that can be obtained by WSD at domain-level sense granularity; • the potential for avoiding or minimizing manual annotation thanks to the generalized information provided by domains. The paper is structured as follows. Section 2 describes the notion of semantic domains and some prior work. Section 3 presents the lexical resource WordNet Domains. Section 4 lays the grounds for the computational modeling of domains using WordNet Domains. In particular the notion of domain relevance for both the textual and lexical levels is introduced. Section 5 presents the computational methods by which semantic domains can be exploited within WSD. Section 6 presents the experiments and evaluation, and Section 7 suggests conclusive remarks.
نتیجه گیری انگلیسی
Domain analysis of text and lexicon can be utilized across different NLP tasks, allowing to unify certain representations and algorithms based on the same methodology and resources. Moreover, domains may constitute a bridge between the lexicon and the text, allowing a deeper comprehension within different lexical semantic phenomena at the paradigmatic level, such as ambiguity and lexical coherence. In this paper we have shown how domain information can be deduced in a principled unsupervised probabilistic manner based on the information available in WordNet Domains, yielding an effective generalized resource for word sense modeling. In particular we considered two issues. First, domains provide informative generalized features for paradigmatic information that improve accuracy, or correspondingly – reduce the amount of annotated examples needed to obtain certain performance. This claim was assessed for WSD at both sense (synset) and domain levels of granularity. Second, domain-level senses provide an appealing coarse granularity for WSD. Disambiguation at the domain level is substantially more accurate, while the accuracy of WSD for the fine-grained sense-level may not be good enough for various applications. Disambiguation at domain granularity is accurate and can be sufficiently practical using only domain information with the unsupervised DDD method alone, even with no training examples. In future work we consider combining domain information with syntagmatic features in more sophisticated ways, relying on additional information from available lexical resources such as WordNet. Our eventual goal is to create a richer and largely unsupervised WSD solution, that might follow some of the underlying principles presented in this paper. We also plan to refine and enrich the domain annotation in WordNet Domains by developing corpus-based techniques for automatic acquisition of domain labels for synsets.