استخراج سودمندی مجموعه اقلام از پایگاه داده های معامله
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
9133 | 2006 | 24 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Data & Knowledge Engineering, Volume 59, Issue 3, December 2006, Pages 603–626
چکیده انگلیسی
The rationale behind mining frequent itemsets is that only itemsets with high frequency are of interest to users. However, the practical usefulness of frequent itemsets is limited by the significance of the discovered itemsets. A frequent itemset only reflects the statistical correlation between items, and it does not reflect the semantic significance of the items. In this paper, we propose a utility based itemset mining approach to overcome this limitation. The proposed approach permits users to quantify their preferences concerning the usefulness of itemsets using utility values. The usefulness of an itemset is characterized as a utility constraint. That is, an itemset is interesting to the user only if it satisfies a given utility constraint. We show that the pruning strategies used in previous itemset mining approaches cannot be applied to utility constraints. In response, we identify several mathematical properties of utility constraints. Then, two novel pruning strategies are designed. Two algorithms for utility based itemset mining are developed by incorporating these pruning strategies. The algorithms are evaluated by applying them to synthetic and real world databases. Experimental results show that the proposed algorithms are effective on the databases tested.
مقدمه انگلیسی
Frequent itemset mining plays an essential role in the theory and practice of many important data mining tasks, such as mining association rules [1], [2] and [17], long patterns [4], emerging patterns [10] and dependency rules [22]. It has been applied in fields such as telecommunications [3], census analysis [5], and text analysis [22]. An itemset is a set (i.e., a group) of items. The goal of frequent itemset mining is to identify all frequent itemsets, i.e., itemsets that have at least a specified minimum support, the percentage of transactions containing the itemset. The rationale behind using support is that only itemsets with high frequency are of interest to users. The practical usefulness of the frequent itemset mining is limited by the significance of the discovered itemsets. There are two principal limitations. First, a huge number of frequent itemsets that are not interesting to the user are often generated when the minimum support is low. For example, there may be thousands of combinations of products that occur in 1% of the transactions. If too many uninteresting frequent itemsets are found, the user is forced to do additional work to select the itemsets that are indeed interesting. Secondly, support, as defined based on the frequency of itemsets, is not an adequate measure of a typical user’s interest. Suppose that the goal of a sales manager is to find the itemsets that can generate a profit higher than a threshold. The following example shows that support based itemset mining may lead to some most profitable itemsets not being discovered due to their low support. Example 1. Consider the small transaction database shown in Table 1 and the unit profit for the items shown in Table 2. Each value in the transaction database indicates the quantity sold of an item. Using Table 1 and Table 2, the support and profit for all itemsets can be calculated (see Table 3). For example, since for the 10 transactions in Table 1, only two transactions, t8 and t9, include both items B and D, the support of the itemset BD is 2/10 = 20%. Since t8 includes one B and one D, and t9 includes one B and 10 Ds, a total of two Bs and 11 Ds appear in transactions containing the itemset BD. Using Table 2, the profit for each item B is 100 and the profit for each item D is 1. Thus, the profit of the itemsets BD could be considered to be 2 × 100 + 11 × 1 = 211. The profit of the other itemsets in Table 3 can be obtained in a similar fashion. Supposing that the minimum support is 40%, the frequent itemsets in Table 3 are D, A, AD, and C, but the four most profitable itemsets are BD, B, AC, and CD, all of which are infrequent itemsets.Example 1 reveals that the frequent itemset mining approach may not satisfy a sales manager’s goal. In this case, the support measure reflects the statistical correlation of items, but it does not reflect their semantic significance. In other words, statistical correlation may not measure how useful an itemset is in accordance with a user’s preferences (i.e., profit). In example, the profit of an itemset depends not only on the support of the itemset, but also on the prices of the items in that itemset. The above limitations motivated us to develop a utility based itemset mining approach, which allows a user to conveniently express his or her perspectives concerning the usefulness of itemsets as utility values and then find itemsets with utility values higher than a threshold. In utility based itemset mining, utility is a quantitative representation of user preference, and the usefulness of an itemset is quantified in terms of its utility value. Formally, an itemset S is useful to a user if it satisfies a utility constraint, i.e., a constraint in the form u(S) ⩾ minutil, where u(S) is the utility value of itemset S and minutil is a threshold defined by the user. In practice, the utility value of an itemset can be measured in terms of cost, profit, aesthetic value, or other measures of user preference. For Example 1, the utility values of itemsets can be represented by their profits as shown in Table 3. For instance, the utility value u(ABCD) = 144 indicates that the supermarket earns $144 by selling the items A, B, C, and D together. Suppose that the utility constraint is u(S) ⩾ 140, which indicates that only itemsets producing a profit of at least $140 are significant to a supermarket manager. The itemset ABCD is interesting, since it satisfies u(ABCD) ⩾ 140. Thus, a utility constraint measures the significance of an itemset from two sides. One side is the statistical correlation of the itemset measured by support; the other side is the semantic significance of the itemset measured by the user. This combination captures the significance of the itemset in the intended application, and reflects not only the statistical significance but also the semantic significance of the itemset. Recent work [8], [9], [15], [18] and [19] has highlighted the importance of constraint based itemset mining, in which the user is allowed to specify his or her focus by means of constraints that capture the semantic significance of the itemset in the intended application. Various semantics, such as the significance of items [8] and [15] or the significance of transactions [9], are expressed as constraints. However, the constraints used in these approaches are convertible [19]. As explained in more detail in Section 2, a constraint is convertible if whenever an itemset violates a property, so do all prefixes of the itemsets w.r.t. a specified ordering of the items in the itemset. While previous studies of convertible constraints have captured useful aspects of the semantic significance of itemset in the intended applications, other natural constraints that may not be convertible are also useful for expressing more complex aspects of semantic significance, such as significant profit. Due to the efficiency of existing methods for frequent itemset mining and convertible constraint based itemset mining, it may seem worthwhile to investigate whether any of their pruning strategies can also be applied to the utility constraint. Unfortunately, they cannot, as we show in Section 2. The main reason is that as more items are included in an itemset, fewer transactions may happen to include the itemset. Thus, utility constraints may not be convertible. In response, we have developed two efficient pruning strategies for utility constraints. In this paper, utility constraints are characterized by a property giving the upper bound of the utility value of an itemset. Two novel pruning strategies based on this property are presented. Then, we develop an algorithm, called UMining, to find all itemsets with utility values higher than minutil from a database. We also develop a heuristic algorithm, called UMining_H, which typically finds most itemsets with utility values higher than minutil based on a heuristic pruning strategy. The algorithms are evaluated by applying them to synthetic and real world databases. Experimental results show that the proposed algorithms are effective on these databases. The main contribution of this paper is that we propose efficient algorithms to handle utility constraints, a kind of inconvertible constraints that can express types of semantic significance that cannot be handled with existing theories and techniques in itemset mining. More precisely, we propose a utility based itemset mining approach, which allows a user to express his or her preferences concerning itemsets using a utility function, a function that relates specific values in a domain to user preferences. By combining a utility function and a transaction database together, itemsets are given utility values that reflect their importance to the user. Then the significant itemsets, the itemsets with utility values that satisfy the utility constraint, are discovered. As a result, the utility based itemset mining approach can discover a group of itemsets that neither frequent itemset mining nor existing convertible constraint based mining techniques can discover. Our utility based itemset mining approach is beneficial for finding significant itemsets in many applications, including web mining and information retrieval. For example, Table 1 can be regarded as representing a set of webpages for web mining, with each column representing a keyword, each row representing a webpage, and the value in each cell indicating the number of occurrence of that keyword in that webpage. Table 1 can also be considered as a set of documents used in information retrieval [20], where each column represents a term, each row represents a document, and the value in each cell indicates the frequency of a term appearing in a document. Table 2 can be looked at as the user’s preference among these keywords or terms. Using the proposed UMining algorithm, the webpages or documents matching a user’s interest could be discovered. Overall, our utility based itemset mining provides a general framework for weighted itemset mining, where the utility value of each item in an itemset represents the weight. The remaining parts of the paper are organized as follows. In Section 2, related work is described. We describe the utility based itemset mining problem in Section 3. In Section 4, the theoretical foundations of utility constraints are analyzed. Two pruning strategies are presented in Section 5. Section 6 describes the UMining and UMining_H algorithms for utility based itemset mining. In Section 7, our experimental results are presented and analyzed. Finally, in Section 8, conclusions are drawn.
نتیجه گیری انگلیسی
The problem of utility based itemset mining is to discover the itemsets that are significant according to their utility values. In this paper, we first showed that the Apriori property and convertible constraint property are not directly applicable to the problem of utility based itemset mining. As a result, the mathematical properties of the utility value of an itemset were analyzed. Next, two novel pruning strategies were presented to reduce the cost of finding high utility itemsets. With these pruning strategies, a k-itemset with a utility upper bound less than minutil can be pruned immediately without accessing the database to calculate its actual utility value. By exploiting these pruning strategies, the UMining and UMining_H algorithms were developed to provide efficient solutions to the utility based itemset mining problem. The effectiveness of the algorithms was demonstrated by applying them to synthetic and real world databases. The experimental results indicate that the proposed algorithms can discover high utility itemsets efficiently. UMining may be preferable to UMining_H, because it guarantees the discovery of all high utility itemsets. Utility based itemset mining was compared to frequent itemset mining, convertible constraint based mining, and share based mining. Utility constraints are capable of expressing more complex semantics than the support measure, convertible constraints, or the share measure. Previously developed algorithms are more efficient for support based or convertible constraint based mining problems. For utility based mining, the proposed UMining and UMining_H algorithms are more efficient than any previous algorithms. As well, UMining guarantees that all high utility itemsets are found, while the heuristic share based methods may miss many relevant itemsets. Overall, when a utility constraint is required to describe more complex user preferences, the UMining algorithm can efficiently find all useful itemsets from a database, while methods for frequent itemset mining, convertible constraint based mining, and share based mining cannot. Finally, we offer a few suggestions for future research. As given, the UMining algorithm is based on a level-wise approach. Since depth-first search approaches, such as MAFIA [7] and FP-growth [11], have several advantages over level-wise approaches, future research could consider whether our pruning strategies can be incorporated into an approach such as FP-growth. In addition, the problem of how to identify high utility rules from high utility itemsets could be investigated. It would be valuable to identify mathematical properties that would allow our results to be extended from nonnegative utility functions to more general utility functions. Furthermore, following the suggestion of Kleinberg et al. that a discovered pattern is interesting only to the extent that it can be used in the decision-making process of the user to increase utility [12], the actionability of high utility itemsets could be investigated. Actionability is the ability of the selected itemsets to suggest a highly profitable action for the user. We believe that our utility based itemset mining is appropriate for modelling user preference regarding itemsets in many applications and that identifying potential applications would be valuable.