روش داده کاوی بهبود یافته با استفاده از موارد پیش بینی شده
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
21427 | 2009 | 9 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 36, Issue 1, January 2009, Pages 72–80
چکیده انگلیسی
In this paper, we present a mining algorithm to improve the efficiency of finding large itemsets. Based on the concept of prediction proposed in the (n, p) algorithm, our method considers the data dependency in the given transactions to predict promising and non-promising candidate itemsets. Our method estimates for each level a different support threshold that is derived from a data dependency parameter and determines whether an item should be included in a promising candidate itemset directly. In this way, we maintain the efficiency of finding large itemsets by reducing the number of scanning the input dataset and the number candidate items. Experimental results show our method has a better efficiency than the apriori and the (n, p) algorithms when the minimum support value is small.
مقدمه انگلیسی
Years of effort in data mining have produced a variety of efficient techniques (Chen, Han, & Yu, 1996). Depending on the types of datasets processed, mining approaches may be classified as working on transaction datasets, temporal datasets, relational datasets, or multimedia datasets, among others. On the other hand, depending on the classes of knowledge derived, mining approaches may be classified as finding association rules, classification rules, clustering rules, or sequential patterns (Agrawal & Srikant, 1995), etc. Among these techniques, finding association rules from transaction datasets is usually an essential task (Agrawal et al., 1993b, Agrawal and Srikant, 1994, Agrawal et al., 1997, Ezeife, 2002, Han and Fu, 1995, Mannila et al., 1994, Park et al., 1997, Srikant and Agrawal, 1995, Srikant and Agrawal, 1996 and Wojciechowski and Zakrzewicz, 2002). Many algorithms for mining association rules from transactions are proposed, most of which are executed in level-wise processes. That is, itemsets containing single items are processed first, then itemsets with two items are processed. The process was repeated, continuously adding one more item each time, until some criteria are met. The famous apriori mining algorithm was proposed by Agrawal et al., 1993a and Agrawal et al., 1993b. The apriori iterates two phases, the phase of candidate generation and the phase of verification. Possible large itemsets are produced in the first phase and verified in the second phase by scanning the input dataset. Since itemsets are processed level by level and datasets had to be scanned in each level, the verification phase thus dominates the performance. Han, Pei, and Yin (2000) then proposed the Frequent-Pattern-tree (FP-tree) structure for efficiently mining association rules without generation of candidate itemsets. The FP-tree was used to compress a database into a tree structure which stored only large items. Several other algorithms based on the FP-tree structure have also been proposed. For example, Qiu, Lan, and Xie (2004) proposed the QFP-growth mining approach to mine association rules. Zaiane and Mohammed (2003) proposed the COFI-tree structure to replace the conditional FP-tree. Ezeife (2002) constructed a generalized FP-tree, which stored all the large and non-large items, for incremental mining without rescanning databases. Koh and Shieh (2004) adjusted FP-trees also based on two support thresholds, but with a more complex adjusting procedure and spending more computation time than the one proposed in this paper. Denwattana and Getta (2001) proposed an algorithm (referred to as the (n, p) algorithm) to reduce the numbers of scanning input datasets for finding large itemsets. The (n, p) algorithm also iterates two phases, the phase of prediction and the phase of verification. Unlike the apriori, the (n, p) algorithm predicts large itemsets for p-levels in the first phase and verifies all these p-level itemsets in the second phase. A heuristic estimation method is presented to predict the possibly large itemsets. If the prediction was valid, then the approach is efficient in finding the actually large itemsets. In this paper, we propose a mining algorithm to improve the efficiency of finding large itemsets. Our approach is based on the concept of prediction presented in the (n, p) algorithm and considers the data dependency among transactions. As the (n, p) algorithm does, our method iterates the same two phases but uses a new estimation method to predict promising and non-promising candidate itemsets flexibly. The estimation mechanism computes for each level a different support threshold derived from a data dependency parameter and determines whether an item should be included in a promising candidate itemset directly by the support values of items. Since we reduce the number of candidate itemsets to be verified by the new estimation mechanism and the number of scanning of the input dataset by the concept of prediction of the (n, p) algorithm, the performance of finding large itemsets can be improved. The rest of this paper is organized as follows. Section 2 presents the related works of finding large itemsets. The apriori algorithm and the (n, p) algorithm are reviewed. In Section 3, we describe our motivation and the theoretical foundation of our method. Detailed description of our algorithm is given in Section 4 together with a simple example. Experimental results and the comparison on the performance of the apriori, the (n, p) algorithm, and our method are shown in Section 5. Conclusions are given in Section 6.
نتیجه گیری انگلیسی
From the experimental results, different values of data dependency will cause the same large itemsets, but different predictive effects. When w = 1, the non-promising candidate sets are predicted very well, but the promising candidate sets are predicted badly; and vice versa for w = 0. By default, we set w = 0.5. If the data dependency relationships in transactions can be well utilized, our method can improve the overall performance of finding large itemsets. In our experiments, both the (n , p ) algorithm and ours suffer from the inefficiency of generating View the MathML sourceCi+1+ from View the MathML sourceCi+. When there are many items in the dataset, e.g., 25 items in D1 ∼ D6, and more levels of transactions to be considered, more computation is needed in both algorithms. However, our method provides a more accurate approach for predicting itemsets and obtains a better performance than the (n, p) algorithm, especially when p > 2. In this paper, we have presented a mining algorithm that combines the advantages of the apriori and the (n, p) algorithm in finding large itemsets. As the (n, p) algorithm does, our algorithm reduces the number of scanning datasets for finding p levels of large itemsets. A new parameter that considers data dependency is included in our method for early filtering out the itemsets that are possibly of lower supports and thus improves the computational efficiency. We also conclude that the three algorithms can compete with each other and gain the best performance on different types of datasets. There need more studies on how to tune the parameters, such as n, p, and transaction threshold in the (n, p) algorithm and w, t in ours, before the mining task is performed.