کشف همبستگی خطی در پایگاه داده ها: یک روش داده کاوی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|21399||2005||27 صفحه PDF||سفارش دهید||11525 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Data & Knowledge Engineering, Volume 53, Issue 3, June 2005, Pages 311–337
Very little research in knowledge discovery has studied how to incorporate statistical methods to automate linear correlation discovery (LCD). We present an automatic LCD methodology that adopts statistical measurement functions to discover correlations from databases’ attributes. Our methodology automatically pairs attribute groups having potential linear correlations, measures the linear correlation of each pair of attribute groups, and confirms the discovered correlation. The methodology is evaluated in two sets of experiments. The results demonstrate the methodology’s ability to facilitate linear correlation discovery for databases with a large amount of data.
As competition among businesses continue to increase, it is crucial for organizations to discover knowledge that could give them advantages over their competitors. In the past, such knowledge often was obtained by collecting data and testing it against some predefined hypothesis (i.e., a hypothetico-deductive approach to obtaining knowledge). Lately, greater emphasis has been given to discovering (inducing) knowledge from existing databases. The knowledge discovery approach employs various data mining algorithms such as association rule mining algorithms  to obtain knowledge from databases. However, very little work has investigated the possibility of automating traditional data analysis using statistical methods for knowledge discovery , ,  and . Because the amount of data generated and accumulated continues to exceed the number of available experienced analysts , it is imperative to develop methods to automate and expedite data analysis for knowledge discovery from existing databases  and . This research establishes a novel discovery methodology to induce business knowledge (also called business intelligence) in the form of linear correlations for better decision making. As an illustration of the usefulness of such automated knowledge discovery, consider an organization with globally distributed factories that wants to determine how factory effectiveness can be improved. Factory effectiveness includes various aspects, such as, cost per unit produced, output per day and factory downtime. Furthermore, many possible factors could influence factory effectiveness, including wages, reliability of supply, and age of the factory. In traditional data analysis, data analysts must first propose a set of hypotheses for testing. Statistics software, such as SAS/STAT and SPSS Base, can provide only the mechanisms to test these possible relationships . Therefore, data analysts are responsible for ascertaining the appropriate analysis that will identify relationships through hypothesis testing, and must manually select the appropriate factor, outcome, and measurement function for each analysis. Often, fatigue, overhead, manpower cost, and the limits of human cognitive capability detract from a thorough understanding and complete analysis of the sheer volume of data available from existing business databases. In this research, we demonstrate that these manual tasks of traditional data analysis (i.e., proposing hypotheses and selecting factors, outcomes, and measurement functions) can be automated for knowledge discovery in databases. Specifically, we automate linear correlation discovery (LCD), the goal of which is to determine whether two attributes or sets of attributes (i.e., attribute groups) have a relationship. A thorough discussion of LCD is presented in Section 2.1. Some previous work has addressed related problems. Hou  developed a system that determined whether a regression or a classifier were appropriate for analyzing a system. The SNOUT project  derived some properties of attributes that could be leveraged for analysis. Aladwani  created an expert system to select an appropriate multiple-comparison test. Some authors have developed clustering or classification algorithms based on correlation (e.g., derivatives of Principle Component Analysis ), while others have developed pre-processors to select the ‘best’ algorithm for a given task  and . However, to our knowledge, no one has attempted specifically to automate linear correlation discovery. 1.1. Research objective and contributions The objective of this research is to demonstrate the feasibility of automating and expediting the LCD process. To do so, we develop a methodology that contributes to knowledge discovery in databases, particularly LCD, in four ways: • Classify attributes. We semi-automatically derive the measurement properties of attributes such as distance and order and employ this information to classify attributes. Attribute classification occurs through the analysis of schema information, such as attribute data type and length, and data contents (attribute values) of the target relational database. Although users and analysts may be able to perform this task manually, our goal is to automate as much of the LCD process as possible to reduce unnecessary human involvement. • Consider attribute groups. We consider potential correlations among sets of attributes (i.e., attribute groups) as well as among individual attributes. For example, it is not sufficient to conjecture that the age of a factory or employee wage alone influences factory performance. It is instead necessary to conjecture that factory age and wages together affect factory performance. • Determine the correlation measurement functions. We establish a set of heuristic rules to determine the most appropriate correlation measurement function to measure each potential induced linear correlation. • Confirm the discovered correlations. The discovered linear correlations are confirmed by repeating the same measurement on subsequent data samples. Therefore, artifact discoveries can be rejected automatically. To evaluate the proposed methodology, we have developed a prototype system, the Linear Correlation Discovery System. The prototype uses the Visual Basic language as well as MS Access and SPSS Base as the underlying database and statistical analysis packages, respectively . 1.2. Paper organization The remainder of this paper is organized as follows. Section 2 introduces LCD and provides an overview of the proposed methodology. Section 3 discusses the use of random samples to expedite the discovery process. Sections 4 and 5 discuss the induction and measurement of potential linear correlations, respectively. Section 6 presents the method to confirm discovered correlations. Section 7 then elaborates on the evaluation experiments. Section 8 concludes the paper and discusses further directions. Appendix A and Appendix B present the heuristic rules established for the proposed methodology.
نتیجه گیری انگلیسی
We present an automatic discovery methodology to expedite LCD from relational databases. The proposed LCD methodology improves knowledge discovery by enabling researchers to: (1) Discover linear correlations automatically. The methodology automates LCD by inferring the measurement properties of attributes, which are derived by examining schema information and the attributes’ values. Our established set of heuristic rules can determine the proper correlation measurement function for each potential linear correlation. (2) Consider attribute groups when identifying correlations. By identifying potential correlations, the methodology considers correlations between sets of attributes (i.e., attribute groups), as well as between individual attributes. (3) Confirm the discovered correlations. Because the discovery confirmation method repeats the measurement of the discovered correlations on a set of data samples drawn from the original relation, artifact discoveries can be rejected automatically. Our work on LCD has opened up several avenues of further research. For example, the LCD methodology considers only correlations of attribute groups with randomly distributed errors. It would be interesting to consider attribute groups with other error distributions as well. For example, time series analysis is performed on attribute groups in which errors are distributed according to a time sequence . The automation of time series analysis remains an unsolved research problem. The LCD methodology discovers correlations without considering the direction of their relationships. Whereas the LCD methodology can determine that Defects and Avg_Wage are related, it cannot determine whether increasing Avg_Wage leads to lower Defects or if lower Defects leads to a better Avg_Wage. One extension might automatically establish structural equation models , or models of directional connections, that are based on the discovered correlations. In addition, it is necessary to derive the measurement properties of attributes to automate LCD, but it is computationally intensive to derive these properties through the analysis of schema information combined with data instances. This computational effort might be reduced if the measurement properties could be captured during the database design process and then embedded into the database schema. Determining the measurement properties that might be appropriate for embedding is an interesting research topic for database design. Our method also adopts fairly primitive methods to screen out uninteresting pairs of multi-attribute data. Other mechanisms to screen out pairs of data with low correlation (e.g., those in  and ) could potentially enhance our method’s performance. Finally, whereas database management systems have standard query languages and interfaces (e.g., SQL, ODBC) to facilitate easy access to data, such standardization does not exists for statistical data analysis packages. A language similar to SQL for data analysis should be developed to facilitate data analysis and knowledge discovery.