ارزیابی عملکرد مدلهای ارزیابی بازاریابی مستقیم
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
23569 | 2001 | 14 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Interactive Marketing, Volume 15, Issue 1, 2001, Pages 49–62
چکیده انگلیسی
Direct marketers commonly assess their scoring models with a single-split, gains chart method: They split the available data into “training” and “test” sets, estimate their models on the training set, apply them to the test set, and generate gains charts. They use the results to compare models (which model should be used), assess overfitting, and estimate how well the mailing will do. It is well known that the results from this approach are highly dependent on the particular split of the data used, due to sampling variation across splits. This paper examines the single-split method. Does the sampling variation across splits affect one's ability to distinguish between superior and inferior models? How can one estimate the overall performance of a mailing accurately? I consider two ways of reducing the variation across splits: Winsorization and stratified sampling. The paper gives an empirical study of these questions and variance-reduction methods using the DMEF data sets.
مقدمه انگلیسی
Direct marketing scoring models are used to predict the future behavior of a group of customers.Consider an example. Suppose that a catalog company plans to circulate a back-toschool catalog to its customers and that it must decide which of its customers should receive the book. Sending a catalog to someone who is not interested in purchasing from the book is usually not profitable; therefore the catalog company would like to know who is likely to make a purchase. Scoring models can help the catalog company with this task, as well as many related tasks such as determining which prospects should receive a book.Scoring models are usually built using historical data. In the back-to-school example, the company probably circulated a similar book during the previous year and observed who responded and who didn’t. The company could use data from the previous year to make decisions about this year’s mailing. It could use a predictive modeling technique such as regression to estimate how much each customer spent during the previous year, provided that the customer received the offer last year. Next, it would predict this quantity using purchase history it had on the customers prior to the mailing, starting with versions of recency, frequency, and monetary value. After estimating the model, it would apply the model to the current purchase history, called scoring the database, and have a better idea of who will respond to this year’s offer. The functional form and estimation of scoring models has been the focus of much recent research (see, e.g., Bult, 1993; Bult &Wansbeek, 1995; Colombo & Jiang, 1999; Zahavi& Levin, 1997; Magidson, 1988; Hansotia &Wang, 1997; Malthouse, 1999).This paper evaluates an important question that direct marketers fitting scoring models face: how can I assess the performance of a model? There are two reasons to ask this question:c Model selection: the relative question. When a company builds a scoring model it usually ends up with several possible models and must choose one for implementation. To do this, it must know how one model performs relative to another. For example, the company may have used stepwise regression to select a subset of predictor variables for the final model; after using stepwise regression it must choose one of the resulting models. Alternatively, it may have tried different modeling techniques. In addition to using a regression model, perhaps it tried CHAID and neural network models as well; which is better? Also, the company could be evaluating whether or not to use overlaid variables, which it must pay to use;for example, the company could use ziplevel Census demographics for free, or could buy more accurate demographic information.In deciding whether or not to purchase the more accurate demographics,it must evaluate whether or not they improve the performance of its models relative to those using zip-level data only.c The absolute question. A second reason to evaluate a model is to estimate the performance of a mailing for planning purposes.How much demand will a particular circulation plan generate? In this case the objective is to understand how the model performs in absolute terms; for model selection the emphasis is on assessing the performance of one model relative to another. For example, the business plan for a company might specify that a certain number of customers must be “active” at the end of a time period. The gains chart for a scoring model will help predict how many customers will activate. Another example is assessing how a model will do on the margin. If one more book is mailed, what is the chance that this customer will activate. Such information is important in planning circulation across different campaigns, e.g., new customer acquisition,current customer retention, and former customer re-activation. In both examples there is a need to know how a model will perform in absolute terms.This distinction is important because some of the methods discussed below will help modelers decide between models, but will give biased estimates of the model performance in absolute terms.The performance of a model is evaluated with a gains table (Jackson & Wang, 1994, pp. 174–177), sometimes called a decile analysis (David Shepard Associates, 1998, chap. 24). An example gains table is shown in Table 1. Gains charts are computed by sorting the data in descending order of scores (the predicted demand), assigning the observations to quantiles, and then computing average and cumulative average dollar amounts or response rates by quantile. In the example gains chart, the average amount spent by those in the top quintile was $4.91 (Mean column); the average amount by those in the second quintile was $2.75. The average amount spent by the top two quintiles combined was $3.83 (Cumulative average amount column).The cumulative average column tells us that the average amount of money we would make per customer if we mailed to 40% of the file is $3.83.This column, or some function of it such as cumulative lift,1 is usually used for comparing models. If we knew that we wanted to send the mailing to 40% of our customers, we would usually want to use the model that generated the most money for us, which can be assessed by examining the cumulative columns.One problem with this approach is that we are using the same data that we used to estimate our model to evaluate it. It is possible that our model was too flexible and overfitted the data,i.e., it captures idiosyncrasies of the particular data set used for estimation. In this case the gains chart will suggest that the model will perform much better than it will in practice. Overfitting is particularly a problem with very flexible functional forms such as those of CHAID and neural network models. A closely related problem is estimating the prediction error of a regression. It is well known that estimates of prediction error made from the same data used to estimate the model are biased downwards(Efron & Tibshirani, 1993, p. 248; Breiman and Spector, 1992, section 2.4), even if the model does not overfit the data. Efron calls this bias the optimism of the estimate.Little has been published on assessing scoring model performance. The solution that direct marketers often follow is to use a test set.The data are split into two parts, one used to estimate the model and the other to validate it.This procedure will be called the single-split method in this paper. The single-split method is problematic because the results are highly dependent on the particular split of the data.David Shepard Associates (1998, chap. 26) gives an example where a model is fitted on 181,100 observations and evaluated on two test sets; the cumulative columns vary by an alarming amount, despite the large sample sizes. They propose a way to construct confidence intervals for these estimates. The focus of their work seems to be on the absolute question.This paper focuses on the question of model selection, although it also makes recommendations pertaining to the absolute question. It ex-amines how sampling variation affects our ability to choose among candidate models. By using the single-split validation procedure, how often do we choose an inferior model? This question is of critical importance to practitioners, because using an inferior model means that we are increasing the number of offers we send to people who are not interested; this can damage our brand and reduce the profitability of a particular mailing. This paper also considers ways of improving the basic test-set approach. In particular,it evaluates methods of handling outliers and sampling procedures for splitting the data.
نتیجه گیری انگلیسی
This paper examines the single-split approach of assessing the performance of a scoring model.This method is popular with practitioners because it is easy to implement in commercial software and the amount of computational time required to implement it is small, but the results are highly dependent on the particular split used. The alternative is to use a resampling method, which is far less dependent on the set of splits used, but is more difficult to implement on commercial software and is computationally much more expensive.We examine the quality of the answers produced using the single-split method, and look for ways to improve it without increasing the computational burden. The conclusions depend on the reason why model performance is being assessed.Model selection and the “relative” question. Despite the alarming amount of sampling variation across splits, the relative performance of models using the single-split method is usually consistent for the models considered here. This is good news to practitioners because it means that they can continue to use the single-split approach during this phase of the analysis, provided that the results here apply to other data sets as well. The procedure of making a decision based on a single split breaks down when the average performances of the models considered are nearly equal. When the difference in performance between two models is small, the resultsin this paper suggest that Winsorizing and stratified sampling improve our ability to make a decision. When the difference in large, these variance-reduction methods do not seem to be necessary, although they don’t seem to do any harm either. Therefore, I suggest Winsorizing the data and stratifying when selecting a model.Most importantly, when deciding between two models, practitioners should estimate the models using the same “training” data and compare them using the same “test”data.The absolute question. When estimating the overall performance of a final model, Winsorization should not be used because it produces biased results and most likely increases the mean squared error of the estimates when the size of the test set is large. Stratification reduces the variance of our estimate without introducing any bias. If the practitioner chooses to split the data when evaluating absolute performance,stratification should be used.The results from this analysis may or may not apply to other scoring models and data sets.Further empirical studies under other conditions(e.g., response models where there are no outliers, zip models where the signal-to-noise ratio is much smaller) and using other data sets are certainly necessary. This paper makes several contributions. First, it identifies two distinct reasons to examine a gains chart. The method used to evaluate the gains chart depends on the this reason. Second, it discusses two variancereduction methods, Winsorization and stratified sampling, and explains why these methods could reduce the variation across splits. Third, it proposes a methodology to evaluate the results from the single-split method with and without the variance-reduction methods. Fourth, it applies this methodology to a real direct marketing data set and draws useful conclusions that can be easily implemented by practitioners.