استخراج اصطلاحات اسناد پراکنده دامنه خاص غیردستوری
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|21081||2013||11 صفحه PDF||سفارش دهید||8763 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 40, Issue 7, 1 June 2013, Pages 2530–2540
Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers’ repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.
The recent years have witnessed a proliferation in unstructured text data. According to several studies (Blumberg and Atre, 2003 and Russom, 2007), unstructured texts, in the form of customer complaint emails or engineers’ repair notes (e.g. job-sheets), constitute an overwhelming 80% of all corporate data. Buried within these massive amounts of corporate texts are meaningful information nuggets, such as pertinent domain-specific terms. For example, the term “proximity sensor”, appearing in engineers’ repair notes, indicates that the product component,2 “proximity sensor”, experienced a malfunction, and had to be serviced by an engineer. Efficiently and effectively (accurately) detecting such terms from large repositories of texts is crucial in a wide range of corporate activities. For example, in Product Development-Customer Service (PD-CS) organizations, terms that designate products are useful in “cost of non-quality” analyses for determining which products contribute most significantly to maintenance costs. Terms also provide valuable information on which products fail recurrently and require frequent servicing. This information can be subsequently exploited to improve the product development process, resulting in better quality products and more satisfied customers. In Natural Language Processing (NLP), several techniques exist for extracting terms from large collections of general and bio-medical texts (Ahmad et al., 1994, Chung, 2003, Dagan and Church, 1994, Frantzi and Ananiadou, 1999, Frantzi et al., 2000 and Uchimoto et al., 2001). These techniques often rely on external knowledge resources, such as ontologies, which is beneficial for their accuracy (Hiekata et al., 2010 and Zhang et al., 2009). However, most extant term extraction (TE) algorithms are inadequate to address the challenges posed by domain-specific texts, such as those in corporate domains like PD-CS. A major challenge is the sparse nature of these texts, which do not offer reliable statistical evidence, and severely compromise the algorithms’ performance. This difficulty is further compounded by the lack of comprehensive domain-specific knowledge resources, for e.g. corporate ontologies, which are difficult to create and to maintain (Auger and Barriere, 2010, Blohm and Cimiano, 2007, Lapata and Lascarides, 2003, Maynard and Ananiadou, 2000 and Pecina and Schlesinger, 2006). Another challenge is the detection of multi-word terms, especially those comprising 2 or more words, such as “frequency convertor control board”. In addition, there is the issue of ambiguous constructs, such as “device is regulating switch” (“the device is the regulating switch” vs. “the device is regulating the switch”), and that of incoherent phrases, such as “customer helpdesk collimator shutter”. We will elaborate on these difficulties in Section 2.4. In response to the above challenges and to the growing need of organizations for extracting terms from corporate texts, we present ExtTerm, a novel framework for domain-specific TE. Our core contributions are as follows: • ExtTerm accurately detects terms from sparse, domain-specific text collections that do not offer reliable statistical evidence, thereby overcoming the issue of data sparsity. • We extract arbitrarily long terms, including those containing more than 2 words. • Our term extraction approach is unsupervised, eschewing the need for domain-specific knowledge resources (e.g. ontologies), which are not always available and expensive to construct manually. Instead, we only rely on a readily-available resource, viz. Wikipedia, as a knowledge base. This also shows that readily-available resources, such as Wikipedia, despite their general contents, can still be exploited for domain-specific TE. Thus, they can be exploited to compensate for the lack of domain-specific resources. • ExtTerm accurately discriminates between valid terms and other ambiguous and incoherent expressions. Many of the latter expressions tend to exhibit some of the core properties of terms, and are thus incorrectly extracted by most existing term extraction systems. In our experiments, we evaluate the performance of ExtTerm over a real-life, domain-specific text collection, which was provided by our industrial partners.3 The results reveal that ExtTerm achieves a very high accuracy level in domain-specific TE, and even outperforms the state-of-the-art algorithm of Frantzi et al. (2000). This article is organized as follows. In Section 2, we present some basic notions associated with terms, describe existing work on automatic term extraction, and highlight the challenges posed by domain-specific, sparse, and informally-written texts. In Section 3, we develop our proposed methodology for term extraction from domain-specific documents. Experimental evaluations and performance comparisons against baselines are given in Section 4. We conclude and discuss areas of future work in Section 5. In the remainder of this article, we refer to multi-word terms, for e.g. “proximity sensor”, as complex terms, and to single-word terms, for e.g. “footswitch”, as simple terms. We will also use “text collection”, “texts” and “corpus” (plural: “corpora”) interchangeably.
نتیجه گیری انگلیسی
Most term extraction (TE) techniques developed to date have predominantly focused on large, well-written corpora, such as newspaper and bio-medical texts. These texts provide reliable linguistic and statistical evidence, which facilitate the detection of terms. Furthermore, existing TE techniques often rely on readily-available knowledge resources, such as ontologies, to support their term extraction process, leading to substantial performance gains. However, the desiderata of large, well-written corpora and readily-available knowledge resources are rarely replicated in many corporate environments. In the domain of Product Development-Customer Service (PD-CS), for example, repair notes created by engineers tend to be sparse and ungrammatical (informally-written), which makes it hard to accurately detect terms from their contents. This difficulty is further compounded by the lack of readily-available, domain-specific knowledge resources. Consequently, traditional TE techniques exhibit several shortcomings and face a number of challenges in extracting terms from these types of domain-specific texts. As a result, their performance is severely compromised. In this article, we addressed these difficulties by presenting ExtTerm, a novel framework for term extraction from sparse, ungrammatical domain-specific documents. Our contributions to the TE literature and main innovations are as follows. Unlike existing techniques, ExtTerm overcomes the issue of data sparsity by accurately detecting rare terms, even those appearing with very low frequency in a corpus. Thus, it does not suffer from the issue of silence, which is beneficial to its recall. ExtTerm also precisely rejects irrelevant expressions even if they appear frequently in the corpus, mitigating the issue of noise and improving its precision. Furthermore, we present a technique, hinged upon the theoretical notion of term formation, for detecting arbitrarily long terms, including those containing 2 or more words. In addition, we show that open domain (general) knowledge resources, such as Wikipedia, can be exploited to support term extraction from specific domains. The main benefit of relying on such resources is that they are readily-available, comprehensive (large) and accurate. Therefore, they constitute an attractive alternative to compensate for the lack of domain-specific resources such as ontologies.