یک الگوریتم کارامد برای استخراج مجموعه آیتم درون معامله ای بسته
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
9188 | 2008 | 24 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Data & Knowledge Engineering, Volume 66, Issue 1, July 2008, Pages 68–91
چکیده انگلیسی
In this paper, we propose an efficient algorithm, called ICMiner (Inter-transaction Closed patterns Miner), for mining closed inter-transaction itemsets. Our proposed algorithm consists of two phases. First, we scan the database once to find the frequent items. For each frequent item found, the ICMiner converts the original transaction database into a set of domain attributes, called a dataset. Then, it enumerates closed inter-transaction itemsets using an itemset–dataset tree, called an ID-tree. By using the ID-tree and datasets to mine closed inter-transaction itemsets, the ICMiner can embed effective pruning strategies to avoid costly candidate generation and repeated support counting. The experiment results show that the proposed algorithm outperforms the EH-Apriori, FITI, ClosedPROWL, and ITP-Miner algorithms in most cases.
مقدمه انگلیسی
Mining association rules, which is a fundamental problem in the area of data mining, has been extensively studied in recent years [1], [4], [9], [16], [18], [20], [21], [23], [29], [31], [32], [33], [38], [40], [46] and [49]. Traditional association rule mining algorithms focus on association rules among itemsets within a transaction. Taking stock market databases as an example, association rule mining can be used to analyze the share price movements. Suppose a database records the price of every stock at the end of each trading day, an association rule might be: “if the stock prices of Microsoft and IBM go up, the price of Apple is likely to go up on the same day.” This classical association rule expresses the associations among items within the same transaction, thus we call it intra-transactional association rule. However, the traditional approaches cannot capture a rule like: “if the stock prices of Microsoft and IBM go up, the price of Apple is likely to go up two days later.” This rule represents some association relationships among the itemsets from different transactions, thus we call it inter-transaction association rule. A number of methods have been proposed for mining inter-transaction association rules [8], [10], [12], [22], [25], [26], [28] and [41]. Lu et al. [25] first used inter-transaction association rules to predict stock market movements. Subsequently, two algorithms, E-Apriori and EH-Apriori, for finding frequent inter-transaction itemsets were proposed in Ref. [26], where EH-Apriori adopts an additional pruning technique used in Ref. [34]. Feng et al. [10] used several optimization techniques, namely, joining, converging, and speeding, to enhance the EH-Apriori algorithm for mining inter-transactional association rules under rule templates. Then, Tung et al. [41] proposed the FITI algorithm, which is implemented in two phases. First, the frequent intra-transaction itemsets are discovered. Then, these itemsets are used to form the frequent inter-transaction itemsets. More recently, Lee et al. [22] proposed an algorithm, called ITP-Miner, which uses an ITP-tree to mine all frequent inter-transaction itemsets in a depth-first search manner. It has been shown that the ITP-Miner algorithm outperforms the previous inter-transaction mining algorithms. Besides, Li et al. [24] extended the inter-transaction association rules to a more general form of association rules, called generalized multidimensional inter-transactional association rules, which expand rule contexts from point-wise (e.g. two days latter) to scope-wise (e.g. within three days). For example, such generalized inter-transaction association rules could be: “after McDonald and Burger King open branches, KFC will open a branch within two months and between one and three miles away.” In inter-transaction itemsets mining, there are a large number of frequent itemsets and the mining process could be extremely time-consuming. Thus, we incorporate the concept of closed itemsets into inter-transaction itemsets mining. That is, we only mine closed inter-transaction itemsets, instead of all frequent itemsets. A frequent itemset X is said to be closed if the database does not contain a superset of X with equal support [15], [35], [37], [39], [43] and [47], where the support of an itemset is defined as the number of transactions containing the itemset in the database. Generally speaking, mining closed itemsets is more efficient than mining a complete set of frequent itemsets. Therefore, we will mine closed ones in our proposed method. In the last decade, many algorithms have been proposed for mining closed frequent itemsets [11], [15], [27], [29], [35], [36], [37], [39], [42], [43] and [47]. These algorithms can be classified into four categories [44], namely “test-and-generate”, “divide-and-conquer”, “hybrid”, and “hybrid without duplication”. The “test-and-generate” category includes CLOSE [36], and A-close [35] algorithms, which stress on the optimization of a level-wise process to discover the closed itemsets. The “divide-and-conquer” category includes CLOSET [37], TFP [15], CLOSET+ [43], and FP-CLOSE [11] algorithms, which use the highly compact data structure FP-tree (Frequent-Pattern tree) [14] within the closed itemset discovery process. Pei et al. [37] devised an algorithm, called CLOSET, which uses an FP-tree and a partitioning technique to mine frequent patterns. Subsequently, Han et al. [15] developed the TFP algorithm, for mining top-k closed frequent patterns without a minimum support constraint. Wang et al. [43] integrated more effective strategies and proposed an FP-tree-based method called CLOSET+ that builds conditional projected databases in two ways: bottom-up and top-down. FP-CLOSE [11] is a variant of CLOSET+ for mining closed itemsets. The “hybrid” category includes CHARM [47], and CloseMiner [39] algorithms, which use properties of both “test-and-generate” and “divide-and-conquer” techniques. Zaki et al. [47] proposed an algorithm, called CHARM, which uses an itemset-tidset tree and four pruning properties to enumerate all closed itemsets. Based on the CHARM algorithm, Singh et al. [39] proposed an algorithm, called CloseMiner, which transforms the problem of discovering closed itemsets into a problem of clustering the itemsets with closed tidsets. The “hybrid without duplication” category includes DCI-CLOSED [27], LCM [42], and PGMiner[30] algorithms, which can avoid the main drawback of the “hybrid” algorithms. These algorithms do not need to store, in main memory, the closed itemsets previously mined since they do not require performing the subsumption checking. The DCI-CLOSED [27] and LCM [42] algorithms traverse the search space in a depth-first search manner. The differences between DCI-CLOSED and LCM are the strategies for finding the closure and the data structures for storing the context extracted. The PGMiner [30] constructs a prefix graph structure and decomposes the database to bit vectors of variable lengths, which are assigned to the nodes of the prefix graph. Then, it uses both inter-node and intra-node pruning strategies to mine closed patterns. To mine closed inter-transaction itemsets, Huang et al. [17] proposed the ClosedPROWL algorithm, which uses closed intra-transaction itemsets to form closed inter-transaction itemsets. The algorithm is implemented in two phases. First, the CHARM algorithm [47] is used to discover closed intra-transaction itemsets. Then, the itemsets discovered are used to form closed inter-transaction itemsets. However, ClosedPROWL can not efficiently prune non-closed itemsets during the mining process. To resolve this problem, we propose an approach called ICMiner that efficiently mines closed inter-transaction itemsets. Our proposed algorithm consists of two phases. First, we scan the database to find all frequent items. For each frequent item found, we scan the database to find a set of domain attributes (called a dataset) containing the frequent item. Next, the ICMiner constructs an ID-tree to generate the closed inter-transaction patterns in a depth-first search (DFS) manner. By using the ID-tree and datasets, the effective pruning strategies can be embedded to avoid costly candidate generation and repeated support counting. Therefore, the ICMiner algorithm can efficiently mine the closed inter-transaction patterns. The remainder of the paper is organized as follows. We present the problem definition in Section 2, and describe the proposed algorithm in Section 3. Then, we analyze the time and space complexities of the ICMiner and the comparing algorithms in Section 4, and discuss the performance analysis in Section 5. Finally, we provide some concluding remarks in Section 6.
نتیجه گیری انگلیسی
We have proposed an efficient algorithm, called ICMiner, for mining closed inter-transaction itemsets. By using the ID-tree, we can efficiently mine the closed inter-transaction patterns in a depth-first search manner. Moreover, we can avoid generating many frequent but non-closed patterns by using pruning strategies for closed patterns. Thus, our proposed algorithm can efficiently mine closed inter-transaction patterns. The performance study on the synthetic, real, and “worst case” datasets shows that the ICMiner algorithm is more efficient than the EH-Apriori, FITI, ClosedPROWL, and ITP-Miner algorithms in most cases. In the future work, we will address a number of research issues related to the ICMiner algorithm. First, we can extend our algorithms to mine the generalized inter-transaction association rules [24], which can expand rule contexts from point-wise (e.g. two days latter) to scope-wise (e.g. within three days). Second, in the future, we may develop efficient method to mine other concise representations of frequent patterns, such as non-derivable itemsets [6], essential itemsets [7], etc. Third, in this paper, we have only discussed the problem of 1-dimensional closed inter-transaction itemsets, so we will investigate the problem of n-dimensional closed inter-transaction itemsets. Fourth, since the closed inter-transaction itemsets mining may draw an overwhelming number of rules, we can put the focus on the generic bases of association rules [2], which consist of non-redundant association rules by using closed frequent itemsets and their generators. Therefore, we may extend the ICMiner algorithm to keep track of the set of closed patterns and their associated minimal generators during the mining process. Fifth, we will apply our proposed algorithm to other domains, such as marketing data and web access logs, to assess its usability. Finally, without generalization, too many patterns may be mined and they may be too detailed. However, by generalizing with a concept hierarchy, we may be able to obtain patterns or rules that are more abstract and meaningful.