کشف از ارتباط مخفی در یک پایگاه داده معامله محلی مبتنی بر تفاوت همبستگی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
9116 | 2006 | 10 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Engineering Applications of Artificial Intelligence, Volume 19, Issue 4, June 2006, Pages 419–428
چکیده انگلیسی
Given a transaction database as a global set of transactions and its local database obtained by some conditioning of the global database, we consider pairs of itemsets whose degrees of correlation are higher in the local database than in the global one. A problem of finding paired itemsets with high correlation in one database is already known as discovery of correlation, and has been studied as the highly correlated itemsets are characteristic in the database. However, even noncharacteristic paired itemsets are also meaningful provided the degree of correlation increases significantly in the local database compared with the global one. They can be implicit and hidden evidences showing that something particular to the local database occurs, even though they were not previously realized to be characteristic. From this viewpoint, we have proposed measurement of the significance of paired itemsets by the difference of two correlations before and after the conditioning of the global database, and have defined a notion of DC pairs, whose degrees of difference of correlation are high. In this paper, we develop an algorithm for mining DC pairs and apply it to a transaction database with time stamp data. The problem of finding DC pairs for large databases is computationally hard in general, as the algorithm has to check even noncharacteristic paired itemsets. However, we show that our algorithm equipped with some pruning rules works successfully to find DC pairs that may be significant.
مقدمه انگلیسی
In studies of data mining from transaction databases, many researchers have concentrated on finding itemsets with high support, i.e., paired itemsets appearing in association rules with high confidence (Agrawal and Srikant, 1994), or paired itemsets with strong correlation (Aggarwal and Yu, 1998; Brin et al., 1997a and Brin et al., 1997b; Morishita and Sese, 2000). These concepts are considered useful for distinguishing characteristic paired itemsets with strong correlation in a single transaction database. A similar strategy, based on the concept of change of support, known as emerging patterns (Dong and Li, 1999), is successful for finding itemsets characterizing either of two databases. All of these concepts regarding itemsets are therefore proposed to extract paired itemsets with required characteristics in a given database or in one of two or more databases. Some characteristic paired itemsets with high correlation are useful for finding a general tendency. However, there may exist such paired itemsets with high correlation independently of time stamps and multiple times. Some users may regard this as trivial because they already know the relation. On the other hand, as is indicated in the study of chance discovery (Ohsawa and Nara, 2002), some itemset pairs that are not characteristic may also be useful because they are potentially significant under some conditions. For example, a supermarket manager who tries to set up shop in an urban area may wish to know a general tendency for the area. In this case, a general knowledge about the standard goods in the area or about a specified customer is useful for the success of his or her shop. However, when the number of customers begins to increase, the manager cannot meet the needs of the customers by using only the previous knowledge gained. That is, the manager has to react responsively to the changing needs of their customers. However, the needs are not always characterized in the process at the beginning of a change in tendency; rather, in many cases, the needs may not appear as a characteristic one. To detect such uncharacteristic itemset pairs as implicit evidence, we consider an itemset pair as soon as it is seen that its correlation becomes high. If it continues to show high correlation after the change, it can be found by some existing method. On the other hand, it may not always show high correlation if the relation between the itemsets begins to change. As a result, the itemset pair with a large change of correlation but still with a lower correlation is hidden by itemset pairs with high correlation. It is worth marking such itemset pairs as implicit evidences showing that something occurred in recent time because their changes of correlation are very high. Therefore, in this paper, we wish to find itemset pairs with large changes of correlation by conditioning some time stamp compared with a global database with all time stamps. To find such itemset pairs, a notion of DC pairs, which we have already defined in (Taniguchi et al., 2005), can be used. A DC pair is a pair of itemsets such that their degree of difference in correlation before and after some conditioning is very large. We have also improved the algorithm in (Taniguchi and Haraguchi, 2005). Using these algorithms, we try to find useful DC pairs from a family of databases with time stamp data. In the experiment described in the latter section, we examine census data (Ruggles et al., 2004) for 1980, 1990, 2000. In order to simplify our problem, we consider a database containing two time stamps. That is, the 1980 census is combined with the 1990 census, and the global database is the combined databases, the 1980–1990 census. Similarly, the 1990 census is combined with the 2000 census as a global database, the 1990–2000 census. A condition is given the latter time stamp; i.e., the 1990 census in the combined 1980–1990 census and the 2000 census in the combined 1990–2000 census. We show that useful itemset pairs, which may be potentially significant, can be found using the algorithm for finding DC pairs in each combined database. Generally, it is computationally a hard task to find DC pairs, because the degrees of difference of correlations are nonmonotonic with the expansion of the itemsets. For this reason, we consider a restricted problem, given two parameters, ζζ and εε. More precisely, we evaluate the degrees of difference between correlations by a function defined with ζζ and εε, and restrict the DC pairs we are endeavoring to find. Note that a DC pair is syntactically regarded as a compound itemset consisting of two component itemsets. Although various means of mining DC pairs may be considered, we first find candidates for component itemsets, then combine one component with another component, based on our algorithm in (Taniguchi and Haraguchi, 2005). In the first phase, the search space for finding the candidates can be reduced by using a pruning rule depending on εε. On the other hand, in the second phase, we generate a space of paired itemsets by combining candidate component itemsets found in the first phase. As the size of database increases, the number of possible components and their possible combinations becomes larger. So, we need to have some pruning rules to cut off useless combination. For this purpose, we use a simple fact that all the combined itemsets as well as their component itemsets must have non-zero support in the local database. We present a sufficient condition, stated in terms of cooccurrences of items, for combinations violating the support condition to be excluded as useless ones. This paper is an extended version of our work presented at the International Conference on Data Mining and Machine Learning in Pattern Recognition MLDM 2005 (Taniguchi et al., 2005). In Section 2 we describe related work. In Section 3 we define some of the terminology used throughout this paper. In Section 4, we introduce the concept of DC pairs and define our problem of mining DC pairs. An algorithm for finding DC pairs is described in Section 5. Section 6 presents our experimental results. In the final section, we summarize our study.
نتیجه گیری انگلیسی
In this paper, we seek potentially significant itemset pairs from a database with time stamps by using the concept of the DC pair. Given a transaction database DD and its subdatabase DCDC, a pair of itemsets X and Y is called a DC pair if the degree of difference of correlations between X and Y before and after the conditioning DCDC is large to some degree. It should be noted that the correlation is not always high in DCDC, although we can observe some degree of correlation change for DD and DCDC. In this sense, such a pair may not be characteristic in DCDC. Therefore, some DC pairs are considered to have potential characteristics in DCDC. Although the measure of a DC pair is not specifically designed for time series data, we believe that useful DC pairs, which may be potentially significant, can be found from a database with a time stamp. If we formulate the measure of change of correlation for a database with a time stamp, we may be able to find more interesting itemset pairs than those found in this paper. In order to realize this, we have to consider the formulation and an algorithm for finding such itemset pairs efficiently in the future. In order to efficiently find DC pairs, we investigated several pruning mechanisms which can prune useless search nodes (branches) and designed an algorithm adopted them. The computation is divided into two phases. In Phase 1, we can efficiently extract the set of candidates for component itemsets with a look ahead strategy. In Phase 2, then, a restricted pairs of obtained candidates are examined whether they can be DC pairs or not. Our experimental results have also shown effectiveness of the pruning rules in our search.