مطالعه خوشه بندی موجودی 7000 سند اتحادیه اروپا با استفاده از MDS و SOM
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
20646 | 2011 | 15 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 38, Issue 7, July 2011, Pages 8835–8849
چکیده انگلیسی
In this article, we discuss a number of methods and tools to cluster a 7000 document inventory in order to evaluate the impact of EU funded research in social sciences and humanities on EU policies. The inventory, which is not publicly available, but provided to us by the European Union (EU) in the framework of an EU project, could be divided into three main categories: research documents, influential policy documents, and policy documents. To represent the results in a way that non-experts could make use of it, we explored and compared two visualisation techniques, multi-dimensional scaling (MDS) and the self-organising map (SOM), and one of the latter’s derivatives, the U-matrix. Contrary to most other approaches, which perform text analyses only on document titles and abstracts, we performed a full text analysis on more than 300,000 pages in total. Due to the inability of many software suites to handle text mining problems of this size, we developed our own analysis platform. We show that the combination of a U-matrix and an MDS map, which is rarely performed in the domain of text mining, reveals information that would go unnoticed otherwise. Furthermore, we show that the combination of a database, to store the data and the (intermediate) results, and a webserver, to visualise the results, offers a powerful platform to analyse the data and share the results with all participants/collaborators involved in a data- and computation intensive EU-project, thereby guaranteeing both data- and result consistency.
مقدمه انگلیسی
Text mining is a quite mature research field (see e.g. Erhardt et al., 2006 and Yang et al., 2008 and references therein) with many successful applications in a variety of domains such as biomedicine ( Collier and Takeuchi, 2004, Duch et al., 2008, Erhardt et al., 2006 and Lourenço et al., 2009), document classification ( Isa et al., 2009 and SanJuan and Ibekwe-SanJuan, 2006) and document generation ( Yang, 2009 and Yang and Lee, 2005), ontology mapping ( Tsoi, Patel, Zhao, & Zheng, 2009), computer intrusion/fraud detection ( Adeva and Atxa, 2007 and Holton, 2009), etc. Text mining by itself is a multidisciplinary field, and it involves many subtasks such as semantic analysis, clustering, categorisation, etc. Almost all text mining analyses can be split into three major stages: a data preprocessing and/or formatting stage, an extensive information extraction-, transformation-, and selection stage, and, finally, the actual analysis stage and accordingly the visualisation. In the preprocessing stage, the documents are formatted in such a way that they become computer interpretable: binary and non-binary formats like Microsoft Word or Hypertext Markup language (HTML) are converted into flat text (ASCII), with or without preservation of the document structure. In the second stage, one extracts information from the documents, reduces the existing redundancy, and finally converts the document in a numerical usable format (Manning and Shütze, 1999 and Salton and McGill, 1983). This second stage is usually concerned with the semantic/lexical analysis methods to discern informative words (or word groups) from non-informative ones, e.g., part-of-speech taggers (POS, Toutanova and Manning, 2000 and Toutanova et al., 2003), natural language parsers (NLP, Klein & Manning, 2002), or named-entity-recognition (NER, Finkel, Grenager, & Manning, 2005) methods. POS methods name every word according to its type: nouns, verbs, plural proper nouns, etc., while NER methods concentrate on sequences of words in a text which are the names of things, persons, institutions, company names, etc. Natural language parsers group (sequences of) words together that act e.g., as subject, adverb, or direct object clause in the given sentence. Sometimes, different extraction methods are combined within the same tool, and many of them are part of larger machine learning and/or text processing suites like WEKA ( Witten & Frank, 2005), GATE ( Cunningham, Maynard, Bontcheva, & Tablan, 2002) or Rapid-I (formerly known as YALE). 1 For an overview of (commercial) text mining suites we refer to the literature, e.g., Yang et al. (2008). From the semantically/lexically analysed sentences, one selects usually only the most informative types of words, e.g., only the extracted nouns ( Yang & Lee, 2005), and converts them into a numerical usable format by applying the vector space model ( Salton & McGill, 1983). This method encodes documents as vectors, in which each vector component corresponds to a different term. 2 The vector components have as values the frequencies the corresponding terms occur in a given document. Finally, one eliminates words that have the same meaning but a different spelling by applying techniques like, e.g., stemmers (Lovins, 1968 and Porter, 1980) that reduce a conjugated verb to its stem, case simplifiers that make words in upper, lower or mixed case appear in the same case, synonym lists or co-word/co-occurrence analysis techniques (Erhardt et al., 2006), or metrics based on the inverse document frequency (IDF). The IDF expresses how many documents contain a given term. At the end of the second stage, one retains a collection of document vectors that are maximally informative with a minimum amount of redundancy. In this article, we report on an EU project, the aim of which was to analyse, classify, and visualise EU funded research in social sciences and humanities in EU framework programmes FP5 and FP6. This project, called the SSH project for short, was aimed at the evaluation of the contributions of research to the development of EU policies (see Section 2). In this article, we focus on discourse links, which group documents together when they are semantically similar. Consequently, they cover the same topic/domain and/or refer to the same group of other items (documents). For the visualisation of the document clustering, we focus on the self-organising maps (SOM), a very commonly applied method, also in text mining (Isa et al., 2009, Vesanto, 1999, Yang, 2009 and Yang and Lee, 2005), and on the multi-dimensional scaling (MDS) technique, which is rather rarely applied for text mining (Chen et al., 2008 and Duch et al., 2008), but offers great advantages when dealing with huge data sets. Since the total document inventory consists of 7038 unstructured documents, all together over 300,000 pages, we had to be very selective within the pool of the existing methods and software-suites to be able to obtain results within an acceptable timespan. Due to the size of the inventory, we had to perform any linkage analysis in an unsupervised way: we have category-labels for each document of the inventory (see Fig. 1), but these refer to the origin of the document rather than its contents, albeit that both are related. Therefore, supervised methods like naive Bayesian classifiers (see e.g., Isa et al., 2009) or exploratory techniques that rely on the document structure, are not applicable. In addition, one of the most frequently occurring problems with the mentioned (commercial) software-suites are their limited processing capabilities: they can handle only a limited number of documents and/or pages or do perform weakly. The size of our inventory makes also that some interesting methods like co-word/co-occurrence analyses (Erhardt et al., 2006) are too time- and memory consuming. Consequently, we opted to develop our own analysis platform. This platform is described in more detail in Appendix B. Full-size image (86 K) Fig. 1. Overview of the different document categories present in the inventory. Mark the three main categories, namely, research documents, influential policy documents, and policy documents. Source: Idea Consult, Brussels. Figure options This article is structured as follows: we first briefly discuss the data set, and then the methodology and how it is applied: data preprocessing, information retrieval and extraction, dimensionality reduction of the document vector space, and linkage analysis and visualisation of the results. Next, we present and discuss the obtained results. Finally, we end this article with a conclusion and some perspectives for further research.
نتیجه گیری انگلیسی
We focused in this article on document clustering of huge document inventories using text mining. The approach we proposed, and which is a combination of a central database server – that contains all initial, intermediate, and final results –, parallellised algorithms, a High Performance Computing (HPC) infrastructure, and a webserver to display the obtained results, is very well suited to tackle this type of clustering problem. In addition, the webpages contain hyperlinks, which in turn permits one to quickly inspect the documents grouped in the visualised clusters. Data consistency is hereby guaranteed through the use of the database server, which acts as a backbone for the whole analysis process, and which, consequently, makes that all data is kept identical to all computing units. Result consistency is guaranteed by the combination of the database server and the webserver that generates the webpages with the results. Result consistency means here that everyone who is consulting the results, looks at the same data and figures. The combined data and result consistency is a great advance when working with several partners on one project. The methods used to perform the actual analysis are common to many text mining applications: data preprocessing, information extraction, selection and cleanup, and finally the application of some clustering tool. For clustering, we explored both the self-organising map (SOM) and multi-dimensional scaling (MDS). Where the SOM delivers the most detailed information, the MDS acts rather as a quick informer thereby highlighting trends. Both are definitely complementary. However, considering the computational cost, the MDS has a major advantage given the fact that its underlying Euclidean distance matrix can be calculated in parallel, and that this matrix can also be used to calculate, e.g., the cross-similarity matrices. As soon as parallel versions of SOM algorithms with a relative small memory footprint become available, this disadvantage of the SOM will diminish of course. A possible go between, could be to first reduce the number of dimensions using MDS, principal component analysis (PCA) or any other projection technique instead of the here used IDF-based metric. However, this might introduce a problem related to the fact that such dimensionality reduction techniques are based upon projections: when reducing to a small number of dimensions, each new dimension is in fact a combination of underlying terms. Consequently, when showing the results, it becomes harder to detect the basis on which the documents were grouped together. Another way could be to eliminate/combine those dimensions that are redundant by their occurrence patterns as indicated by a correlation analysis, feature selection, or input variable selection. When we compare the exact and similarity term matching, we found no major differences, neither between the obtained MDS maps, nor between the obtained SOM maps. In addition, it must be said that the MDS maps based on the document acronym vectors correspond slightly better with the MDS maps based on the lexical document vectors, when the latter are constructed with the exact term matching algorithm, instead of the similarity term matching algorithm. Given the fact that the exact term matching algorithm is much faster, we can recommend the exact term matching, based on the used document inventory and on the MDS results, thereby disregarding the – at least intuitive – advantage of an algorithm that also accounts for possible misspellings. Given our setup (database server, HPC calculation unit, and webserver-based result visualisation) we foresee that this combination can handle even larger document inventories. Indeed, the single database server can be extended to a database cluster thereby increasing both performance and network bandwidth with the HPC infrastructure that performs the necessary calculations. Small, but straightforward adaptations to each of the three components makes this combination perfectly scalable.