روش جدیدی برای کشف دانش خرده فروشی با اطلاعات قیمت از پایگاه داده های معامله
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
9169 | 2008 | 10 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 34, Issue 4, May 2008, Pages 2350–2359
چکیده انگلیسی
With the advances in information technology and the emergence of Internet commerce, analysis of transaction data has become a crucial technique for effective decision-making and strategy formation in business operations. It is especially critical for retail management, in both online and brick-and-mortar stores. Traditional research in mining retail knowledge, however, does not take into account the products’ prices and how such settings can affect potential demand. This paper opens a new research dimension by treating products’ prices as an important decision variable in mining retail knowledge. To the best of our knowledge, the problem addressed in this paper has never been dealt with in existing research papers. We propose a representation scheme to incorporate price information into historical transaction data. An efficient algorithm is developed to “dig” out implicit, yet meaningful, patterns with price information. In addition, an extensive and well-designed experiment is executed, showing that the algorithm is computationally efficient and that the proposed analysis is significant and useful.
مقدمه انگلیسی
Data mining extracts implicit, previously unknown, and potentially useful information from databases. According to the classification scheme proposed in Chen, Han, and Yu (1996), major approaches to data mining include mining association patterns, clustering, classification, mining sequential patterns, data generalization and summarization, and traversal pattern analysis. Among them, mining association patterns is probably the most popular because of its widespread applications. This approach was first introduced in Agrawal et al., 1993 and Agrawal and Srikant, 1994, and can be stated as follows. Given a database of sales transactions, an association pattern, denoted as X, is a set of items that frequently co-occur in databases. To find association patterns from databases, we first need to calculate the support of itemset X, where the support of X is the percentage of transactions in the database containing X. If its support is higher than the user-specified minimum support (minsup), we claim that itemset X is frequent. Otherwise, it is infrequent. Since association patterns are useful and easy to understand, they have been used in many successful business applications, including finance, telecommunications, marketing, recommendation, retailing, and web analysis (Bose and Mahapatra, 2001, Changchien and Lu, 2001, Chen et al., 2005, Lee et al., 2001, Lin et al., 2003 and Wang and Shao, 2004). The method has also attracted increased research interest, and many extensions have been proposed in recent years, including (1) algorithm improvements (Brin et al., 1997, Chen and Ho, 2005, Han et al., 2000 and Rastogi and Shim, 2002), (2) fuzzy patterns (Chen and Huang, 2005 and Kuok et al., 1998), (3) multi-level patterns (Clementini et al., 2000 and Han and Fu, 1999), (4) quantitative association patterns (Park et al., 1997, Srikant and Agrawal, 1996 and Hsu et al., 2004), (5) spatial association patterns (Clementini et al., 2000 and Koperski and Han, 1995), (6) inter-transaction patterns (Lu, Feng, & Han, 2000), (7) interesting association patterns (Bayardo and Agrawal, 1999 and Freitas, 1999), and (8) temporal association patterns (Ale and Rossi, 2000, Chen et al., 2003, Li et al., 2001 and Roddick and Spiliopoilou, 2002). (Chen et al., 1996) and (Han & Kamber, 2006) give brief literature reviews of association patterns. Previous research on mining association patterns in transaction databases usually assumed that a transaction is formed from a set of items bought in that transaction (Agrawal et al., 1993 and Agrawal and Srikant, 1994). In other words, the research ignored items’ quantities and prices. Although this assumption is widely used, two difficulties may arise. First, in a practical situation, a transaction not only records purchased items, but also their quantities and prices. Therefore, if we view a transaction as only a set of items, a large portion of stored data is unused. Second, the association patterns found in conventional transaction databases only indicate if items are related or not; they do not tell us their quantity and/or price relationships. Without the quantity and price information, it is difficult to design a competitive package for sales promotions because we do not know how the prices and quantities of items influence one another. For example, we may have an association pattern, such as {milk, cheese}. This pattern would be more informative if it was more specific, such as {milk with high price, cheese with medium price}. The former only indicates that these two items are frequently bought together, but the latter tells us that this association happens when milk is at a high price and cheese at a medium price. Some readers may wonder why we did not simply view price and quantity as numerical attributes and use methods for mining quantitative association patterns to deal with them (Hsu et al., 2004, Park et al., 1997 and Srikant and Agrawal, 1996). Using this method, we can partition the prices or quantities of milk and cheese into multiple intervals. For example, we can partition price into five levels, where these five levels represent the list price, 0–10% off the list price, 10–20% off the list price, 20–30% off the list price, and more than 30% off the list price. Consequently, we can have a pattern like {milk with price level 2, cheese with price level 3}. Although this idea seems reasonable, it may result in the following problems. Suppose the time span of the database is 12 months, and the prices of milk and cheese are at levels 2 and 3 only in June and July, respectively. Further assume that there are 200,000 total transactions in the database, and 50,000 of those transactions occurred in June and July. If there were 2000 transactions in June and July containing milk and cheese, what is the support of the pattern {milk with price level 2, cheese with price level 3}? Obviously, the answer should be 2000/50,000 = 4%, rather than 2000/200,000 = 1%. This is because [June, July] is the only period when this pattern can possibly occur, and the base of the support computation should be based on these two months rather than the entire year. The above discussion reveals that we need a new method of defining the supports of patterns with price information. Accordingly, a new type of support, called local support, is proposed to measure the frequency of itemsets with price labels. There are still other problems, however, that we may encounter. Suppose we have three combinations of milk and cheese prices as follows: {milk: p-level 2, cheese: p-level 3}, {milk: p-level 2, cheese: p-level 1}, and {milk: p-level 3, cheese: p-level 3}. Suppose their local supports are 4%, 8%, and 2%, respectively. It is a reasonable conjecture that {milk: p-level 2, cheese: p-level 1} may be good for sales and {milk: p-level 3, cheese: p-level 3} may not, because the former seems to increase sales while the latter decreases sales. For good or bad, they both represent important information that deserves further analysis. This simple example illustrates that we need a new way to define important patterns. In the past, a pattern with high frequency was deemed important. What we are interested in now, however, are those patterns with frequencies deviating substantially from the average, either larger or smaller. In addition to price information, we have to include other important information in the patterns, such as quantity information. Therefore, we define the new patterns with price and quantity information, such as {milk: p-level 2, average-qty 2.3; cheese: p-level 1, average-qty 3.5}. This means that when the price levels of milk and cheese are 2 and 1, respectively, the average quantities of milk and cheese in transactions are 2.3 and 3.5, respectively. By comparing all the patterns with the same product combination, we can understand how items’ prices influence one another and how items’ prices influence quantities. For example, assume that we have the following patterns: {milk: p-level 2, average-qty 2.4; cheese: p-level 3, average-qty 3.3}, sup = 4% {milk: p-level 2, average-qty 2.3; cheese: p-level 1, average-qty 3.5}, sup = 8% {milk: p-level 1, average-qty 1.4; cheese: p-level 1, average-qty 2.0}, sup = 2% Using the first pattern as the basis for comparison, we find that the price levels set in the second pattern may increase the frequency of purchase (the support increases from 4% to 8%), but does not change the average purchasing quantity. On the other hand, the third pattern not only decreases the purchasing frequency but also decreases the average purchasing quantity. This paper defines two new metrics in order to measure a pattern’s level of interest. The frequency strength (FS) measures whether a pattern’s price levels affect its frequency significantly when compared to the average frequency of the same product combination. When the value of FS is greater than 1, the larger the value is, the higher the possibility of an increase in frequency in the price levels of the pattern will be. When the value of FS is smaller than 1, however, the smaller the value is, the more likely it is that the price levels will cause a frequency decrease. In addition to FS, another metric is the quantity strength (QS), which measures whether a pattern’s price levels affect its quantity significantly when compared to the average purchasing quantity. If the value of QS is greater than 1, the average quantity in transactions of that price combination is greater than the average quantity of the same product combination. If the value of QS is smaller than 1, it means the average quantity is lower. With these two metrics, the interesting patterns we want to find are those patterns whose price levels could result in either a large value or a small value in FS, QS, or both. After finding the interesting patterns, we then asked domain experts to help explain the patterns and find the true reasons underlying the phenomenon. The rest of the paper is organized as follows. We formally define the problem in Section 2 and propose an algorithm in Section 3. In Section 4, we discuss how to conduct a series of analyses based on the two proposed metrics, FS and QS. The performance evaluation is performed in Section 5 using a real dataset. Conclusions are drawn in Section 6.
نتیجه گیری انگلیسی
With advances in information technology and Internet commerce, data-driven analysis techniques have become essential for decision-making and strategy formation in business operations. It is especially critical for retail management, in both online and brick-and-mortar stores. Traditional research on mining retail knowledge focuses on assortment planning, demand correlation analysis, and customers’ shopping behavior analysis. It does not take into account the prices of products, and how price setting can affect potential demands. This paper opens a new research dimension by treating prices of products as an important factor in mining retail knowledge. We proposed a representation scheme to represent the price levels of products. Based on this, a novel algorithm was applied to “dig” out good and/or bad child patterns. Next, we defined two new metrics, FS and QS, to measure a pattern’s relevancy. Based on these two measures, all discovered child patterns can be classified into nine categories, some of which are positive for sales, some are neutral, and some are negative. Furthermore, we proposed a statistical method to determine if parent patterns are sensitive to price changes. Finally, we used a real dataset from a retail chain store to perform an extensive experiment. The results indicated that the algorithm is computationally efficient and the discovered patterns can produce satisfactory results. To the best of our knowledge, the problem addressed in this paper has never been dealt with in existing research papers on data mining; no other research considers price information in the historical transaction data. This paper has pioneered research in this area, but there are some critical limitations. First, how the obtained patterns can be used to determine the prices of products remains unclear and needs further study. Second, if a product is a new product or if its price is seldom changed, how can we determine whether it is price-sensitive or not? These questions should be further investigated in future research.