تاثیر دقت و صحت داده بر عملکرد تقسیم بندی: الگوبرداری تجزیه و تحلیل RFM ، رگرسیون لجستیک و درخت های تصمیم گیری
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
1361 | 2012 | 8 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Business Research, Available online 5 October 2012
چکیده انگلیسی
Companies greatly benefit from knowing how problems with data quality influence the performance of segmentation techniques and which techniques are more robust to these problems than others. This study investigates the influence of problems with data accuracy – an important dimension of data quality – on three prominent segmentation techniques for direct marketing: RFM (recency, frequency, and monetary value) analysis, logistic regression, and decision trees. For two real-life direct marketing data sets analyzed, the results demonstrate that (1) under optimal data accuracy, decision trees are preferred over RFM analysis and logistic regression; (2) the introduction of data accuracy problems deteriorates the performance of all three segmentation techniques; and (3) as data becomes less accurate, decision trees retain superior to logistic regression and RFM analysis. Overall, this study recommends the use of decision trees in the context of customer segmentation for direct marketing, even under the suspicion of data accuracy problems.
مقدمه انگلیسی
Nowadays, increased digitization of transactions results in a boost of customer information stored in large transactional databases (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). This evolution has led to the emergence of the database marketing domain as a popular discipline in academic research and business practice (Ko, Kim, Kim, & Woo, 2008). A prominent database marketing application is customer segmentation for direct marketing, where the analyst tries to find homogeneous groups of customers with respect to their response behavior by means of so called data-mining tools (Akaah et al., 1995, Cortinas et al., 2010, McCarty and Hastak, 2007, Merrilees and Miller, 2010 and Morganosky and Fernie, 1999). The usage of data-mining tools in direct marketing is subject to the knowledge discovery in databases (KDD) process, of which the growing importance is reflected by the large number of publications and applications in both academia and business (e.g. Bose & Mahapatra, 2001). KDD prescribes a multi-level process to derive valuable top-level strategic insights from low-level raw data (Fayyad et al., 1996). A typical KDD process consists of the following five consecutive steps: (1) problem identification, in which the application domain is defined and objectives are formulated; (2) data preparation, or selecting, preprocessing, reducing, and transforming the data; (3) data mining, or choosing and applying an appropriate analysis technique; (4) the analysis, evaluation, and interpretation of results; and (5) presentation, assimilation, and use of knowledge (Han and Kamber, 2006 and Martínez-López and Casillas, 2009). Although the success of implementing a KDD process depends on the value of each of its five constituent steps (Crone et al., 2006 and Fayyad et al., 1996), a significant proportion of recent research in direct marketing has unilaterally focused on the data-mining phase and its tools for segmenting customers (e.g. RFM (recency, frequency, and monetary value) analysis McCarty & Hastak, 2007, logistic regression McCarty & Hastak, 2007, decision trees Haughton & Oulabi, 1993 and more advanced techniques, such as artificial neural networks Zahavi and Levin, 1995 and Zahavi and Levin, 1997, support vector machines Viaene et al., 2001, and genetic fuzzy systems Martínez-López & Casillas, 2009). In addition to the choice of the best segmentation tool, data quality (DQ) is an equally important concept in customer analytics (Feelders et al., 2000 and Ko et al., 2008). Prior research shows that bad data yield bad analytical results, often referring to this process as the “garbage in, garbage out” principle (Baesens, Mues, Martens, & Vanthienen, 2009). Consider a marketing department of a direct marketing company that wants to profile its segmented customers according to their monetary value, i.e. their past total money spent at the company. If parts of the monetary value figures are not correct, the uncertainty of calculating the correct average monetary value per segment increases, and consequently the information quality and the segmentation performance decrease. DQ is often considered as a multi-dimensional construct having four subcategories (Wang and Strong, 1996); (1) intrinsic DQ denoting that data have quality in their own right, (2) contextual DQ referring to the fact that DQ should be considered within the context of the task at hand, (3) representational DQ and (4) accessibility DQ both linked to the importance of the information system(s). Although many DQ attributes in each of these four subcategories have been introduced in the literature, this study focuses on how segmentation performance is impacted by the intrinsic DQ attribute accuracy which is defined as conformity with the real world (Wand & Wang, 1996). Three arguments are given to motivate the need for investigating the impact of data accuracy on segmentation performance. First, data accuracy is one of the well-documented attributes in the DQ literature. Second, data inaccuracy can be simulated and its impact is measureable in an objective manner, something which is impossible for other more subjective dimensions of data quality. Third, no research is available that investigates the impact of data accuracy problems upon segmentation performance in a direct marketing setting. In the KDD process, DQ results from choices made in the data-preparation phase (Fayyad et al., 1996). The data-preparation phase consists of the following sub-steps: (i) data selection, aimed at the selection of relevant information while minimizing noise; (ii) data preprocessing; and (iii) data reduction and transformation. Previous research mainly focuses on strategies to improve DQ within the preprocessing and transformation phases and discusses topics such as feature selection (Kim, Street, Russell, & Menczer, 2005), re-sampling (Chawla, Bowyer, Hall, & Kegelmeyer, 2002), outlier detection (Van Gestel et al., 2005), the discretization of continuous attributes (Berka & Bruha, 1998), and the mapping and scaling of categorical variables (e.g., Zhang, Zhang, & Yang, 2003). However, the data selection phase has a significant impact on DQ, and thus on its impact on segmentation performance. Problems that may arise here are missing values, outdated data values and inaccurate data (Even, Shankaranarayanan, & Berger, 2010). Several authors investigate the merits of missing value imputation in the KDD process (e.g., Batista and Monard, 2003 and Brown and Kros, 2007), but research on the direct impact of inaccurate and outdated data on the performance of segmentation models for direct marketing is not available. In summary, the objectives of this study are to assess the impact of data accuracy problems on the quality of customer segmentation approaches and to uncover whether some segmentation techniques are more resistant to these problems than others. The paper has the following organization. Section 2 describes the segmentation approaches and the evaluation metric. Section 3 describes the experimental setup of this study, and Section 4 describes the impact of data accuracy problems on the segmentation performance for real-life direct marketing data. Section 5 revises the managerial implications of the impact of poor data accuracy, and finally Section 6 summarizes the results and offers suggestions for further research.
نتیجه گیری انگلیسی
The goal of this research is to investigate the impact of the level of data accuracy on the performance of three segmentation algorithms. Using two real-life direct marketing data sets, this study treats a specific number of observations as inaccurate by subsequently replacing a fraction of the values of a variable with randomly generated values from the variable range. The general linear model demonstrates that in the absence of (additional) data accuracy issues, CHAID is the superior choice compared to RFM and logistic regression, while RFM is preferred over logistic regression for the low response rate data set, and for small file depth levels having a high response incidence. The introduction of data accuracy problems deteriorates the performance of all three techniques. Overall, CHAID is considered as the best choice for both low and high response data sets under data accuracy problems, because (1) CHAID is less sensitive to increasing problems in data accuracy than logistic regression or RFM which enforces the superiority of CHAID, or (2) under the condition that CHAID is more sensitive to data inaccuracies than RFM, CHAID's performance stays much stronger than RFM which makes CHAID a safe choice for direct marketing analysts under all conditions. In addition, the sensitivity towards data accuracy issues between RFM and logistic regression depends on the height of the response rate, i.e. logistic regression seems less or equally sensitive to data inaccuracies in a low response data set, while the inverse is true for settings with high response rates. This leads to a minor change in the preference as observed under the optimal data accuracy situation; logistic regression becomes a competitor of RFM for medium to high file depths levels for low response data sets. Several avenues for further research are identified. As this study compares the impact of the three most common segmentation approaches in a direct marketing setting, further research should investigate the sensitivity of other segmentation models. In addition, this study uses the same information for the different algorithms (i.e., RFM). This limitation enables a fair comparison among the different segmentation approaches and helps inform the conclusions. However, in practice, direct marketers are not constrained to these three variables. Thus, research should incorporate other transactional and socio-demographic information in CHAID and logistic regression to increase the quality of discrimination between responders and non-responders. Finally, we truly belief that considerable research opportunities exist to establish research methods that mimic the problem of data accuracy. The merits and drawbacks of these new methods could be compared with this study's implementation of the data inaccuracy simulation.