روش تقسیم بندی در داده کاوی: مقایسه RFM، CHAID، و رگرسیون لجستیک
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22096 | 2007 | 7 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Business Research, Volume 60, Issue 6, June 2007, Pages 656–662
چکیده انگلیسی
Direct marketing has become more efficient in recent years because of the use of data-mining techniques that allow marketers to better segment their customer databases. RFM (recency, frequency, and monetary value) has been available for many years as an analytical technique. In recent years, more sophisticated methods have been developed; however, RFM continues to be used because of its simplicity. This study investigates RFM, CHAID, and logistic regression as analytical methods for direct marketing segmentation, using two different datasets. It is found that CHAID tends to be superior to RFM when the response rate to a mailing is low and the mailing would be to a relatively small portion of the database, however, RFM is an acceptable procedure in other circumstances. The present article addresses the broader issue that RFM may focus too much attention on transaction information and ignore individual difference information (e.g., values, motivations, lifestyles) that may help a firm to better market to their customers.
مقدمه انگلیسی
Segmentation in direct marketing has become more efficient in recent years because of the development of database marketing techniques. These data-mining approaches provide direct marketers with better ways to segment their current customers and develop marketing strategies tailored to particular segments and/or individuals. Over the recent years, database marketing techniques have evolved from simple RFM models (models involving recency of customer purchases, frequency of their purchases, and the amount of money they have spent with the firm) to statistical techniques such as chi-square automatic interaction detection (CHAID) and logistic regression. More recently, neural network models are employed in the database marketing arena (Yang, 2004). In spite of recent statistical advances in data-mining, marketers continue to employ RFM models. A study by Verhoef et al. (2002) shows that RFM is the second most common method used by direct marketers, after cross tabulations, in spite of the availability of more statistically sophisticated methods. There are a couple of related reasons for the popularity of RFM. As Kahan (1998) notes, RFM is easy to use and can generally be implemented very quickly. Furthermore, it is a method that managers and decision makers can understand (Marcus, 1998). This is an important consideration in that a successful technique for a direct marketer is one that differentiates likely responders to a particular mailing from those who are unlikely to respond, yet does so in a way that is easy to explain to decision makers. However, it has been argued that the simplicity of RFM has been overemphasized, but its ability to differentiate, relative to statistical techniques, has not been considered to the extent that it should be (Yang, 2004). Although the efficiency of RFM has been questioned, little research documents its ability relative to newer statistical techniques. This paucity of research is partly because RFM refers to a general approach to data-mining; there are a variety of ways of applying the use of recency, frequency, and monetary value. Research that has been conducted on the efficacy of RFM generally focuses on proprietary or judgmental models of RFM (e.g., Levin and Zavari, 2001 and Magidson, 1988) and not on empirically based RFM models. More recently, research has moved away from RFM and has focused instead on newer, more sophisticated approaches to data-mining (c.f., Deichmann et al., 2002 and Linder et al., 2004). The current study evaluates one popular, empirically based (as opposed to judgmental) approach to RFM. This RFM approach is compared to CHAID and logistic regression, in an effort to understand its capabilities as a database marketing analytical tool.
نتیجه گیری انگلیسی
The two studies present some intriguing findings with respect to the performance of the three segmentation methods. The two datasets present very different circumstances that may be presented to database marketers and the results with respect to these datasets are somewhat different. Study 2 is a non-profit organization that solicits contributions in their house file; the recent mailing that is modeled provided for a response rate of roughly one quarter of the entire file. Given these features of the dataset and mailing, RFM is as successful as CHAID and logistic regression in capturing likely responders to the solicitation at all tested levels of depth of the file (20% to 50%). Furthermore, the parameters of the test for RFM appear to be as reliable as those of CHAID and logistic regression when applied to the hold out sample. Therefore, if one were to consider only the results of study 2, it would be concluded that RFM is generally a robust procedure that is similar to the other two segmentation procedures in its ability to segment likely respondents. It appears that RFM may be successful when the overall response rate is fairly high. The characteristics of study 1, however, present a fairly common scenario for database marketers. This dataset is for a multi-division mail order company; the response rate for the offer is under 5%. Given these relatively common characteristics of a direct marketing situation, the results suggest that Hughes' RFM may not perform as well as CHAID when a marketer only mails to a small portion of the file (i.e., 30% or less). In these instances, CHAID outperforms RFM in terms of reliability and ability to capture likely responders. The actual performance of CHAID in the hold out sample is quite similar to its predicted performance in the test sample. By contrast, the actual performance of RFM in the hold out sample is significantly worse than its predicted performance in the test sample. Also, CHAID captures more respondents than RFM at both the 20% and 30% depths of the file. The superiority of CHAID in these instances suggests that the grouping of dataset members by a statistical algorithm, as in the case of CHAID, may be superior to the arbitrary and a priori groupings of RFM. In this study, CHAID creates fewer cells than the fixed and large number used in RFM. Given the low response rate to the offer and the large number of cells specified by RFM, there is a greater likelihood that chance fluctuation rather than systematic differences play a role in the outcome for RFM compared with CHAID. Thus, the predicted level of response in the test does not hold up when the parameters of the test are applied to the hold out sample. The results across the two studies allow the researchers to consider the circumstances where RFM underperforms relative to CHAID. The findings suggest that RFM may have difficulties when the response rate is low (as in study 1) and the database marketer desires to send an offering to a relatively small portion of the entire file (30% or less). Under these circumstances, RFM may be less reliable than CHAID. Alternatively, when the response rate is relatively high (as in study 2) or the database marketer desires to mail to a relatively large portion of the file, RFM may provide results similar to CHAID and logistic regression. Overall, the study can conclude that Hughes' approach to RFM can perform at an acceptable level in many database marketing situations when a direct marketer is limited to using basic transaction variables. Given that statistical modeling can be more costly than RFM because of the need for highly trained personnel (Drozdenko and Drake, 2002), RFM can be considered an inexpensive and generally reliable procedure. Two caveats or limitations should be considered with respect to the findings. First, the two datasets represent different sets of circumstances, both of which are relatively common in database marketing. The offers that are modeled in these data likely represent fairly common mailings in direct marketing. Therefore, the researchers assume that the characteristics of these files are not unusual circumstances in direct marketing. Having said this, the study must concede that these two examples may not generalize to all other database files. House files of different organizations may have their own peculiar characteristics and different offers may vary on a variety of dimensions. Therefore, the conclusions with respect to the performance of the three segmentation methods must be tempered with the understanding that they may not hold for all database marketing circumstances. For example, if there is a curvilinear relationship between a predictor (e.g., recency) and response, this would likely impact the performance of logistic regression, as this method models a monotonic relationship between predictors and response. Future research that tests these segmentation procedures under a variety of circumstances using simulated data would be useful. Simulating different possible relationships between predictors and response will allow the researchers to further understand the sensitivities of the three segmentation procedures to conditions that may arise in a variety of direct marketing situations. Second, the analyses compared RFM to CHAID and logistic regression where each method is constrained to use the same independent variables of recency, frequency, and monetary value. This constraint is enforced to achieve a fair test of the analytical algorithms of the three methods. Therefore, the conclusions about the relative performance are made with the understanding that the researchers are considering them only in the context of these transaction variables. In practice, however, CHAID and logistic regression are not constrained with respect to the variables that can be used as predictors. Response to a mailing can be modeled with a variety of variables using these two methods. One would assume that more precise modeling could be achieved using other variables. This second caveat raises a broader, and perhaps more important issue. The analysis of recency, frequency, and monetary value, whether by an RFM model or by a statistical technique such as CHAID, focuses entirely on the past behavior of individuals. Although social scientists recognize the power of past behavior as a useful predictor of future behavior, such a narrow focus likely limits the direct marketer in their ability to understand their customers. Zahay et al. (2004) raise this point in their discussion of transactional and relational data. They argue that an emphasis on transactional information is taking a very sales oriented approach to customers. Such an emphasis may aid sales in the short run, however, it does not add to the long term relationship with customers. A consideration of relational data such as information about the motivations, attitudes, values, and lifestyles is taking more of a marketing approach to customers. Although these variables may be less useful than transaction information in their ability to predict a response to an immediate marketing activity (i.e., a mailing), they may be enormously useful in understanding the underlying tendencies in customers. This consideration would favor analytical techniques such as CHAID and logistic regression that can accommodate a variety of personality and individual difference information.