به دنبال به دانه هایی از زمان: کشف الگوهای زمانی در مجموعه های معامله بزرگ
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
9113 | 2006 | 29 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Sciences, Volume 176, Issue 8, 22 April 2006, Pages 1003–1031
چکیده انگلیسی
This paper studies the problem of mining frequent itemsets along with their temporal patterns from large transaction sets. A model is proposed in which users define a large set of temporal patterns that are interesting or meaningful to them. A temporal pattern defines the set of time points where the user expects a discovered itemset to be frequent. The model is general in that (i) no constraints are placed on the interesting patterns given by the users, and (ii) two measures—inclusiveness and exclusiveness—are used to capture how well the temporal patterns match the time points given by the discovered itemsets. Intuitively, these measures indicate to what extent a discovered itemset is frequent at time points included in a temporal pattern p, but not at time points not in p. Using these two measures, one is able to model many temporal data mining problems appeared in the literature, as well as those that have not been studied. By exploiting the relationship within and between itemset space and pattern space simultaneously, a series of pruning techniques are developed to speed up the mining process. Experiments show that these pruning techniques allow one to obtain performance benefits up to 100 times over a direct extension of non-temporal data mining algorithms.
مقدمه انگلیسی
A large amount of data is collected every day in the form of event time sequences. Common examples are customer transactions in supermarkets, logs of network connections, bank transactions, or events related to manufacturing in industry. These sequences are valuable sources to analyze not only the frequency of certain events, but also the temporal patterns with which the events happen. For example, from data consisting of web clicks, one may discover that a large number of web browsers that visit washingtonpost.com in morning hours also visit cnn.com. As another example, in market basket analysis, a supermarket manager may discover that turkey and pumpkin pie are frequently sold together in November. Discovering those patterns may reveal interesting information that can be used for understanding of the behavior of customers, market, or the monitored process in different time periods. However, these types of patterns cannot be discovered by traditional non-temporal data mining approaches that treat all the data as one large segment, with no attention paid to utilizing the time information of the transactions. Returning to the supermarket example, suppose the data set consists of sales data over several years. If look into the entire data set rather than the transactions that occur in November, it is likely that one will not be able to discover the pattern of turkey and pumpkin pie since the overall support for them (i.e., the percentage of all transactions that contain the items turkey and pumpkin pie) will be evidently small. Mining temporal patterns together with frequent itemsets (or events) is challenging since the sizes of itemset space and temporal pattern space are extremely large in practice. In the supermarket example, the itemset space consists of all combinations of items (i.e., itemsets) and the pattern space all subsets of time points which are considered in data mining. Given 1000 items and 365 time points (e.g., days), the size of itemset space and pattern space are 21000 − 1 and 2365 − 1, respectively. Although mining frequent itemsets alone has been extensively studied in the past ten years (see, e.g., [1], [18], [5], [20], [7], [9], [14], [11] and [3]), mining temporal patterns together with frequent itemsets has not attracted much effort until recently (see Section 6 for details). One focus in this area is to consider specific types of temporal patterns instead of arbitrary patterns. Indeed, arbitrary patterns do not give much intuitive meaning in practice. However, the techniques that have been developed for mining one type of temporal patterns cannot be directly applied to mining other types of patterns, which significantly restricts the application of such techniques. For example, the approach to mining periodic patterns cannot be used for mining other types of patterns with no required periodicity. Another focus of previous work has been to consider user-defined patterns and discover frequent itemsets for each of those patterns. Since the temporal patterns that are defined by users are treated individually in data mining, the corresponding algorithms may not be equally efficient if a fairly large number of temporal patterns are considered and when the frequent itemsets and temporal patterns are to be mined at the same time. This paper precisely formulates the problem of mining temporal patterns for frequent itemsets in large transaction sets. In the formulation, it is up to the users to define a set of temporal patterns that are interesting or meaningful to them for data mining. The set of interesting patterns may include any pattern that has been studied in the literature. Section 2.3 will give some examples of such patterns including periodically-repetitive patterns, calendar-based patterns, and the patterns that are recursively defined by some algebraic operations. In the problem formulation, two measures are defined to capture how well a temporal pattern matches the time points given by a discovered itemset. Intuitively, these measures indicate to what extent a discovered itemset is frequent at time points included in a temporal pattern p, but not at time points not in p. Based on these two measures, different notions of match are used to describe the relationship between an itemset e and an interesting pattern p. A full match means that e is frequent at each time point defined by p; a relaxed match requires that e be frequent at most of the time points defined by p. Notice that these two types of match do not describe whether e is frequent at the time points that are not defined by p. In real-life, for example, when people say that the turkey and pumpkin pie are frequently sold together in November, they usually mean that the items are not frequently sold in other months. Exclusive match is used to model this phenomena. Informally, the exclusive match means that e is frequent at most of the time points defined by p but at few time points not defined by p. To deal with a large number of interesting patterns, it is crucial to explore both itemset space and pattern space in order to quickly identify which itemsets are frequent along with which temporal patterns, without having to enumerate many unrelated itemsets and patterns. We generalize the optimization techniques that have been used in previous work and develop some new ones for mining a set of arbitrary patterns for frequent itemsets in large transaction sets. The main contribution of this paper is that we generalize the previous work in several directions. First, we formulate the problem of mining a set of arbitrary temporal patterns along with frequent itemsets in large transaction sets. To be general, no restrictions are placed on the temporal patterns. This is important since many real applications involve mining a mix of many types of patterns, some of which may be arbitrary. Second, we define two new measures for improved capturing of the relationship between the actual pattern of an itemset and an interesting temporal pattern. By these two measures, we are able to model many temporal data mining problems appeared in the literature, as well as those that have not been studied. Third, we develop a series of pruning techniques to reduce the amount of work in data mining. We generalize existing techniques to explore the itemset space and the pattern space simultaneously. We prove the correctness of our techniques and demonstrate their effectiveness through extensive experiments. The rest of the paper is organized as follows. In Section 2, the problem of mining temporal patterns is formally formulated. In Sections 3 and 4, a series of pruning techniques and algorithms are developed to solve the data mining problem. In Section 5, experimental results are presented to evaluate our algorithms. In Section 6, related work is discussed and compared with our work. Finally, in section 7, the paper is concluded with discussions about some future directions.
نتیجه گیری انگلیسی
This paper presented a general model for mining of temporal patterns along with frequent itemsets in large transaction sets. The model allows users to define any types of temporal patterns and choose any of the four types of matches. By exploiting the relationship between itemset space and pattern space, a series of pruning techniques are devised so as to speed up the mining process. Our experiments show performance benefits up to 100 times over a less sophisticated approach. The proposed early-pruning technique is designed to prune out those itemset-pattern pairs that do not satisfy the inclusiveness requirement at the early stage. In the future, we plan to develop similar technique for pruning those that do not satisfy the exclusiveness requirement. The incorporation of such early pruning techniques may further improve the performance of our algorithm. Mining temporal patterns involves investigating not only large itemset space and pattern space, but also a large amount of data collected in a long history. It is thus crucial to develop parallel or distributed algorithms for large scale data mining. It would also be interesting to devise online and incremental algorithms for this problem.