یک روش داده کاوی برای کشف دانش از ساختار مکعب چند بعدی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
21452 | 2013 | 14 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Knowledge-Based Systems, Volume 40, March 2013, Pages 36–49
چکیده انگلیسی
In this research we present a novel methodology for the discovery of cubes of interest in large multi-dimensional datasets. Unlike previous research in this area, our approach does not rely on the availability of specialized domain knowledge and instead makes use of robust methods of data reduction such as Principal Component Analysis and Multiple Correspondence Analysis to identify a small subset of numeric and nominal variables that are responsible for capturing the greatest degree of variation in the data and are thus used in generating cubes of interest. Hierarchical clustering was integrated with the use of data reduction in order to gain insights into the dynamics of relationships between variables of interests at different levels of data abstraction. The two case studies that were conducted on two real word datasets revealed that the methodology was able to capture regions of interest that were significant from both the application and statistical perspectives.
مقدمه انگلیسی
Knowledge discovery aims to extract valid, novel, potentially useful, and ultimately understandable patterns from data [7]. Recently, the integrated use of data mining and Online Analytical Processing (OLAP) has received considerable attention from researchers and practitioners alike, as they are key tools used in knowledge discovery from large data cubes [8], [10], [21], [23], [24], [34], [35] and [37]. A variety of integrated approaches have been proposed in the literature to mine large data cubes for discovering knowledge. However, a number of issues remain unresolved in that previous work [26], [25], [16] and [22], especially on the intelligent data analysis front. Firstly, the prior work assumed that data analysts could identify a set of candidate data cubes for exploratory analysis based on domain knowledge. Unfortunately, situations exist where such assumptions are not valid. These include high dimensional datasets where it may be very difficult or even impossible to predetermine which dimensions and which cubes are the most informative. In such environments it would be highly desirable to automate the process of finding the dimensions and cubes that hold the most interesting and informative content. Secondly, reliance on domain knowledge tends to constrain the knowledge discovered to only encapsulate known knowledge, thus excluding the discovery of unexpected but nonetheless interesting knowledge [14]. Another related issue is that it restricts the application of these methodologies to only those domains where such domain knowledge is available. However, a knowledge discovery system should be able to work in ill-defined domains [20] and other domains where no background knowledge is available [36]. This motivated us to formulate a generic methodology for data cube identification and knowledge discovery that is applicable across any given application domain, including those environments where limited domain knowledge exists. High dimensional and high volume datasets present significant challenges to domain experts in terms of identifying data cubes of interest. The presence of mixed data in the form of nominal and numeric variables present further complications as the interrelationships between nominal and numeric variables have also to be taken into account. A methodology that assists domain experts in identifying dimensions and facts of interest is highly desirable in these types of environments. In this paper, we address these issues by proposing a knowledge discovery methodology that utilizes a combination of machine learning and statistical methods to identify interesting regions of information in large multi-dimensional data cubes. We utilize hierarchical clustering to construct data cubes at multiple levels of data abstraction. At each level of data abstraction, we apply well-known dimension reduction techniques such as Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA) in order to identify the most informative dimensions and facts present in data cubes. PCA was designed to operate with numeric data, whereas MCA works with nominal data. Each of these techniques on its own can operate on appropriate partitions of the original dataset and produce its own sets of variables that are most informative. The research challenge then becomes the integration of the partition containing candidate numeric variables with the other partition containing the most informative nominal variables. We make two main contributions in this paper. Firstly, we generate cubes at different levels of data abstraction and study the effect of abstraction level on information content. Secondly, at each level of data abstraction we identify the most significant interrelationships that exist between numeric and nominal variables, thus enabling the cubes of interest to be identified. The rest of the paper is organized as follows. In the next section, we review previous work in the area of mixed data analysis and assistance towards intelligent exploration of data cubes. In Section 3 we give an overview of the two main statistical approaches used in this paper, namely PCA and MCA. Section 4 presents an overview of our proposed methodology and illustrates the methodological steps with a hypothetical example. Our real world case studies presented in Section 5 show that the knowledge discovered is significant from both the application and statistical perspectives. Finally, we summarize our research contributions in Section 6 and outline directions for future research.
نتیجه گیری انگلیسی
In this research we have demonstrated that the application of classical statistical methods for data analysis such as PCA and MCA can be successfully used in conjunction with hierarchical clustering to uncover useful information implicit in large multi-dimensional datasets. The two case studies presented evidence that inter-relationships between nominal and numeric variables, which are significant from both the application and statistical perspectives, can be discovered without excessive reliance on specialized domain knowledge. Furthermore, the case studies also revealed that the application of hierarchical clustering improved the knowledge discovery process. The dimensions that define the cubes of interest vary from level to level, thus indicating that the interrelationships between the nominal and numeric variables change as the clusters become tighter at the lower levels of the dendrogram. This clearly illustrates the need for automated support, as domain specialists, although knowledgeable, cannot be expected to predict with high precision the dynamics of such relationships at different levels of data abstraction. There are two key directions that we envisage for future work in this area. Firstly, it would be interesting to explore the use of alternative methods for identifying variables of interest. The use of Entropy in identifying the information content of variables at different points in the data hierarchy should be explored and its effectiveness can be compared with those of PCA and MCA that we have used in this research. Secondly, OLAP analysis on cubes of interest can be augmented with the use of association rule mining methods to gain further insights into the interplay between variables at various levels of data abstraction.