دانلود مقاله ISI انگلیسی شماره 24195
ترجمه فارسی عنوان مقاله

انتخاب زیر مجموعه در رگرسیون خطی چندگانه: یک رویکرد جدید برنامه ریزی ریاضی

عنوان انگلیسی
Subset selection in multiple linear regression: a new mathematical programming approach
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
24195 2005 13 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Computers & Industrial Engineering, Volume 49, Issue 1, August 2005, Pages 155–167

ترجمه کلمات کلیدی
برنامه نویسی ریاضی - ابتکارات - آمار چند متغیره - رگرسیون - فهم -
کلمات کلیدی انگلیسی
Mathematical programming, Heuristics, Multivariate statistics, Regression, Lagragian relaxation, GRASP,
پیش نمایش مقاله
پیش نمایش مقاله  انتخاب زیر مجموعه در رگرسیون خطی چندگانه: یک رویکرد جدید برنامه ریزی ریاضی

چکیده انگلیسی

A new mathematical programming model is proposed to address the subset selection problem in multiple linear regression where the objective is to select a minimal subset of predictor variables without sacrificing any explanatory power. A parametric solution of this model yields a number of efficient subsets. To obtain this solution, an optimal or one of two heuristic algorithms is repeatedly used. The subsets generated are compared to ones generated by several standard procedures. The results suggest that the new approach finds subsets that compare favorably against the standard procedures in terms of the generally accepted measure: adjusted R2.

مقدمه انگلیسی

A regression analyst is commonly challenged to select the best subset from a set of predictor variables using some specified criterion. Historically, when there are many predictor variables, one or more subsets with fewer predictor variables are generated using a method of the analyst's choice. Given data of the form {{yi, x1i,…,xki}, i=1,…,n}, the subset selection problem involves selecting a subset M of N (M⊆N) where N={1,…,k} is the index set of the predictor variables {X1,…,Xk} such that some measure of the model's explanatory power is maximized. The main motivation for subset-selection seems to be parsimony: “if 3 regressors can ‘explain’ or ‘satisfactorily fit’ a response Y, why use 4?” as Mandel (1989) notes. Some of the reasons for using only a subset of the available predictor variables (given by Miller, 1984) are • to estimate or predict at a lower cost by reducing the number of variables on which data are to be collected; • to predict more accurately by eliminating uninformative variables; • to describe multivariate data sets parsimoniously; and • to estimate regression coefficients with smaller standard errors (particularly when some of the predictors are highly correlated). A number of studies in the statistical literature discuss the problem of selecting the best subset of predictor variables in regression. Such studies focus on subset selection methodologies, selection criteria, or a combination of both. The traditional selection methodologies can be enumerative (e.g. all subsets and best subsets procedures), sequential (e.g. forward selection, backward elimination, stepwise regression, and stagewise regression procedures), and screening-based (e.g. ridge regression and principal components analysis). Standard texts like Draper and Smith (1998) and Montgomery and Peck (1992) provide clear descriptions of these methodologies. New methodologies, such as Broersen's (1986) stepwise directed search and Breiman's (1995) nonnegative garrote were developed recently. Mitchell and Beauchamp (1988) create a parallel approach to the subset selection problem using a Bayesian perspective. With respect to the selection criteria, a number of measures have been proposed such as adjusted R2, Mallow's Cp, and Akaike's AIC. Once again, Draper and Smith (1998) and Montgomery and Peck (1992) offer adequate explanations on this topic. Several papers can help understand the state-of-the-art in subset selection research. Hocking's (1976) early work provides a detailed overview of the field until the mid-70 s. At about the same time, Berk (1978) reports a computational comparison of various selection procedures, and Thompson, 1978 and Thompson, 1978 details both a review and an evaluation of selection procedures and criteria. Subsequently, Miller (1984) offers a comprehensive survey of selection methods and criteria and discusses the potential pitfalls an analyst faces in using subset selection. Grechanovsky (1987) provides a somewhat similar account, though in a limited way. Sparks, Zucchini, and Coutsourides (1985) examine the same issues, but for the case when there are multiple Y variables. Hoerl, Schuenemeyer, and Hoerl (1986) report a computational study involving ridge regression, sequential and screening-based subset selection. Cavalier and Melloy (1991) use a mathematical programming approach to solve the n-dimensional linear Euclidean regression problem. Recently, Kashid and Kulkarni (2002) propose a new criterion called Sp-criterion for subset selection in multiple linear regression. Opinions regarding the advantages and disadvantages of the various procedures clearly differ, and no final word seems to be forthcoming. We propose a new mathematical programming based approach for subset selection that is similar to “all subsets” and “best subsets” procedures because it concerns itself with the selection of good subsets. However, unlike the “all subsets” procedure, it identifies only a limited number of subsets, and, unlike the “best subsets” procedure, it uses a non-traditional selection criterion. The criterion used is based on the intuition that, in a good model, the correlations between the Y variable and the X variables (Y–X correlations) should be high and those between the X variables (X–X correlations) should be low. The mathematical programming model developed is briefly described in Section 2. This model is parametrically solved to obtain a collection of efficient subsets, described in Section 3. The parametric solution requires repeatedly solving a mathematical program either optimally or using one of two heuristic algorithms. One algorithm is based on a Lagrangian relaxation and the second on a greedy randomized adaptive search procedure (GRASP), described in Section 4. Section 5 provides computational results where the subsets generated by the proposed methods are compared with those generated by the standard sequential procedures found in most statistical packages—viz., forward selection, backward elimination and stepwise regression. Finally, Section 6 concludes and asserts some closing remarks.

نتیجه گیری انگلیسی

Subset selection in multiple linear regression is a problem of great practical importance. There are various methods for subset selection and various selection criteria. While there is no clear consensus regarding which method is the best and which criterion is the most appropriate, there is a general agreement an effective method is needed. This paper proposes such a method based on an intuitive mathematical programming model. The model, solved optimally or heuristically, generates a number of efficient subsets comparing favorably with those generated by standard methods. Clearly, this paper does not put to rest the question about which is the best subset selection method. However, the proposed approach has certain advantages. First, it quickly produces a reasonable number of subsets having the desirable quality. Compared to the standard sequential procedures that come up with a single “best” model, the proposed approach provides the analyst with a set of “best” models lying on the efficient frontier. The analyst has the option of comparing these solutions with respect to his or her own experience in the specific context and also with respect to other statistical criteria. Thus, the proposed approach gives the analyst the flexibility to pick the best among the best. Second, the proposed mathematical model is quite flexible and can easily accommodate practical constraints such as limiting the number of predictor variables or limiting the correlation structure desired. For example, the model can be adjusted such that the analyst can avoid including two predictor variables at the same time or ensure that a fixed number of variables are included in the solutions generated. Finally, the method may help the analyst decide which variables should be sampled before any data is collected. The examples solved in the paper use the sample correlations calculated from the original data. However, a final advantage of the proposed model is that it also allows the analyst to use any correlation value based on personal judgment before any data is collected.