پشتیبانی از تصمیم گیری برای انتخاب مدل رگرسیون لجستیک بهینه
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24984||2012||11 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 39, Issue 10, August 2012, Pages 8573–8583
This study concerns itself with providing user support for a decision problem in logistic regression analysis: given a set of metric variables and one binary dependent variable, select the optimal subset of variables that can best predict this dependent variable. The problem requires an evaluation of competing models based on heuristic selection criteria such as goodness-of-fit and prediction accuracy. This paper documents the heuristics, formalizes the algorithms, and eventually presents an interactive decision support system to facilitate the selection of such an optimal model. This study adds to the sparsely studied domain of expert systems for social science researchers, and makes three contributions to the literature. First, the study formalizes a number of heuristics to arrive at optimal logistic regression models. Second, the study presents two computational algorithms that incorporate these formalized heuristics. Third, the paper documents an implementation of these algorithms through an interactive decision support system. The study concludes with a discussion on the risks of relying too heavily on the system and with future opportunities for research.
This study concerns itself with the problem of model specification in logistic regression analysis. Logistic regression is a special form of regression, where the dependent variable (DV) represents the success or failure of a certain event. Unlike traditional regression, where the DV has an interval or ratio scale, the logistic DV has only two outcomes: true (the event occurred) or false (the event did not occur). Logistic regression analysis is a widely used research method in many areas of the social sciences (see e.g., Peng, Lee, & Ingersoll, 2002). For example, in business and management, logistic regression can predict whether potential customers will convert (Akinci, Kaynak, Atilgan, & Aksoy, 2007), or whether companies will go bankrupt (Chen, 2011 and Ge and Whitmore, 2009). To establish the likelihood of success or failure of a certain event, the researcher typically collects data on a number of variables. After the data is collected, the researcher must formulate a logistic regression model, consisting of a subset of those variables, and use this model to predict whether the event either happens or does not happen. The specification of this model is a crucial activity in logistic regression analysis, because frequently a number of different models will compete for suitability, and yet the outcomes of each of these models may vary significantly. This specific activity, the selection of an optimal logistical model, is the focus of the present study. The study documents the development of a decision support system (DSS) that helps the researcher to specify logistical regression models and analyze these models on statistical performance measures. It does so by scrutinizing all possible variables, calculating performance measures for all possible models, and highlighting the aspects on which each of these models compete with each other. These aspects include the suitability of the independent variables selected, the goodness-of-fit of each model with the underlying data values, and the classification accuracy of the model in correctly predicting the success or failure of a certain event. The study makes three contributions to the literature. First, the study reviews and formalizes a number of heuristics to arrive at optimal logistic regression models. Second, the study presents two computational algorithms that incorporate these formalized heuristics. Third, the paper documents an implementation of these algorithms in an interactive decision support system. Viewed from a wider angle, the study also adds to the application domain of expert systems for social science research. This application domain is notably understudied compared to other domains such as, say, healthcare and finance. Applications in the social science domain are rare. A recent exception is Chu, Tseng, Tsai, and Luo (2009), who develop a system to explore the statistical significance of group mean differences in a questionnaire data set. The rest of this paper is organized as follows. First, the paper analyzes and reviews the heuristics that lead to optimal model selection. This analysis results in the formulation of two algorithms. The paper then presents a demonstration of the decision support system, in which each of these algorithms is illustrated with a real data set. A discussion of risks of using the system and opportunities for further research conclude the paper.
نتیجه گیری انگلیسی
This paper has formalized the selection of an optimal logistical regression model by introducing two algorithms and a decision support system. One advantage of the DSS is that it exhausts all possible combinations of independent variables specified by the user. This ensures that no possible suitable combination is overlooked. A second advantage of the decision support system is that it highlights the superiority of models on several independent model success criteria: goodness-of-fit, predictive accuracy, and number of variables. Because the system allows the user to iteratively compare and rank logistic regression models on these criteria, the researcher can conveniently, and almost instantly examine which models are superior. There are also disadvantages to using the system. One of the most important risks is the temptation to use the system for the purposes of data fishing. When this happens, the researcher is looking for the model that fits the data best, regardless of whether the model makes any theoretical sense. It is important to point out that the system should only be used with variables that make a priori statistical sense. The DSS is designed to statistically explore different versions of logistic models that have been formulated using previously postulated relationships. For example, if independent variables v1, v2, and v3 are specified as possible predictors for the outcome of the binary variable, the system will seek to find the best results from the models (v1, v2, v3), (v1, v2), (v1, v3), (v2, v3), and finally (v1), (v2), and (v3) separately. In all cases, the model involved is a reflection of previously hypothesized relationships. Another risk, related to but not necessarily associated with the previous risk, is the risk of over-fitting. Over-fitting occurs when too much error variance from the independent variables is used to predict the value of the dependent variable. If this occurs the information value of the specific data set is optimized, and although the best model is obtained in terms of goodness-of-fit or predictive accuracy, this model is only good for the specific data set with which it was developed, and the results will breakdown when the model is applied to a new data set. The decision support system will not detect over-fitting as such and consequently it is up to the researcher to spot the danger and mitigate the risks. To diagnose over-fitting, the researcher will need to examine particularly suspect models that have very high levels of model fit, and a substantial number of independent variables. For example, if the system includes independent variables that would intuitively be non-sensical (even though they contribute to “better” models), then this may be a indication of over-fitting. An appropriate strategy to counter-balance the risk of over-fitting is to split the sample in half, with one sample providing the basis for the model, and the other one (the hold-out sample) to test the model. The system can easily accommodate this in its data set management module. It goes without saying that the researcher should also make sure that no non-sensical independent variables are included in the model (even though the system might highlight these as being informative). There are a few areas for further development. The first area is the inclusion of nominal independent variables. The second one is the automatic investigation of interaction of two variables. Both areas of extension have one element in common: they both need the generation of new variables. To include independent variables with nominal values, a special technique called “dummy” coding creates a set of binary variables with the binary coding representing the nominal values. To model the interaction, an auxiliary variable must be created that multiplies the values of the two variables involved. At the moment, the system does not automatically create those auxiliary variables, and the user must create them outside of the system before the data set is loaded into the system. The decision support system described in this paper facilitates an important activity in logistic regression analysis: the evaluation of the logistical regression models according to statistical performance measures. It is a powerful tool to compare and systematically facilitate this modeling exercise. As with other powerful tools, it should be used with caution. It is hoped that use of the algorithms and the system by other researchers will lead to a more widespread adoption of logistic regression, and that a greater understanding of the strengths and pitfalls of logistic regression will ensue.