موارد داده های با نفوذ هنگامی که معیار CpCp برای انتخاب متغیر در رگرسیون خطی چندگانه استفاده می شود
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24198||2006||15 صفحه PDF||سفارش دهید||7067 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 50, Issue 7, 1 April 2006, Pages 1840–1854
The influence of data cases when the CpCp criterion is used for variable selection in multiple linear regression analysis is studied in terms of the predictive power and the predictor variables included in the resulting model when variable selection is applied. In particular, the focus is on the importance of identifying and dealing with these so-called selection influential data cases before model selection and fitting are performed. A new selection influence measure based on the CpCp criterion to identify selection influential data cases is developed. The success with which this influence measure identifies selection influential data cases is evaluated in two example data sets.
Multiple linear regression analysis is a widely used and well documented statistical procedure. Two aspects of regression analysis which have been particularly well investigated are identifying and dealing with influential data cases, and selecting a subset of the explanatory variables for use in the regression function. Standard references on the first issue include Cook (1977), Belsley et al. (1980) and Atkinson and Riani (2000), while Burnham and Anderson (2002) and Miller (2002) provide recent overviews of the second issue. Although influential data cases and variable selection have separately been extensively dealt with in the literature, relatively little have been published on investigations into a combination of these two problems. We briefly refer to some of the relevant references. Chatterjee and Hadi (1988) propose measuring the effect of simultaneous omission of a variable and an observation from the data set in terms of changes in the values of the least squares regression coefficients, the residual sum of squares, the fitted values, and the predicted value of the omitted observation. Peixoto and Lamotte (1989) investigate a procedure which adds a dummy variable for each observation to the explanatory variables. Variable selection is then performed, and observations corresponding to selected dummy variables are pronounced to be influential. Léger and Altman (1993) identify conditional and unconditional approaches to the problem of identifying influential data cases in a variable selection context. In the conditional approach the full data set is used to select a set of explanatory variables, and case diagnostics are then calculated conditional on this model, i.e., the set of selected variables remains fixed. In the unconditional approach we apply variable selection to the full data set and calculate a vector of fitted values; we then omit the data case under consideration from the data set and repeat the variable selection as well as calculation of the vector of fitted values; finally, a standardised distance between the two vectors of fitted values is calculated to measure the influence of the omitted case. Léger and Altman (1993) argue that the unconditional approach is preferable since it explicitly takes the variable selection into account when trying to quantify the influence of a given data case. Arguing along similar lines, Hoeting et al. (1996) point out that the model which is selected can depend upon the order in which variable selection and outlier identification are carried out. They therefore propose a Bayesian method which can be used to simultaneously select variables and identify outliers. In this paper we restrict attention to variable selection using the CpCp statistic proposed by Mallows (1973). Our contribution is the introduction of a new p -value based procedure for identifying influential data cases in this context. Weisberg (1981) shows how the CpCp statistic can be written as a sum of n terms (where n is the number of data cases), with each term in the sum corresponding to one of the n cases. In Section 2 of this paper we provide a brief exposition of the coordinate free approach to linear model selection, and in Section 3 we will see that the breakup of the CpCp statistic described by Weisberg (1981) can also be formulated within the coordinate free approach. Section 4 of the paper is devoted to a discussion of the p-value based procedure for identification of influential data cases in a variable selection context, and Section 5 contains two examples illustrating application of the procedure. We close in Section 6 with conclusions and open questions.
نتیجه گیری انگلیسی
In this paper we indicated how Mallows’ CpCp statistic for a given subset of predictor variables can be expressed as a sum of n terms, each term corresponding to one of the data cases. A basic problem arising from this representation of the CpCp statistic is how to decide whether a specific term in such a representation is significantly small or large, which would serve as an indication that the data case concerned is selection influential with respect to the subset concerned. Our proposal for making such a decision is based on estimating p -values corresponding to given data cases and given subsets of predictor variables. Two approaches for obtaining such p -values were described, leading to pij(C)pij(C) in (17) and pij(D)pij(D) in (20), respectively. Although the pij(C)pij(C)-values would seem to be based on a firmer theoretical basis than the pij(D)pij(D)-values, we found in the examples described in Section 5 (and also in other examples not reported here) that the pij(D)pij(D) -values provide a much clearer indication of potentially selection influential data cases. A definite avenue of further research is an investigation into other approaches for deciding whether a given value in a representation of the CpCp statistic as a sum of n terms is significantly small or large. Several other questions also deserve further investigation. (a) Should data cases which have been identified as selection influential be omitted from the data set, or should these cases rather be down-weighted before a model is selected and fitted? (b) In this paper attention was restricted to variable selection making use of the CpCp statistic. This choice can be justified from the fact that in practical applications the CpCp criterion is frequently used for variable selection. There are of course many other selection criteria that can be used, and investigating selection influence for these other criteria will be a worthwhile exercise. (c) Using the p-value approach to decide whether data points are selection influential requires comparing calculated (or estimated) p-values to a given significance level. Specifying the latter is a problem that deserves further attention. For example, should we take into account the fact that nt p-values are compared to the significance level, and adjust the p-value for this multiplicity of comparisons? One possibility would be to use a Bonferroni type of adjustment, but this requires further investigation.