اثر خطاهای اندازه گیری در انتخاب پیش بینی کننده مدل رگرسیون خطی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
24238 | 2007 | 13 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 52, Issue 2, 15 October 2007, Pages 1183–1195
چکیده انگلیسی
Measurement errors may affect the predictor selection of the linear regression model. These effects are studied using a measurement framework, where the variances of the measurement errors can be estimated without setting too restrictive assumptions about the measurement model. In this approach, the problem of measurement is solved in a reduced true score space, where the latent true score is multidimensional, but its dimension is smaller than the number of the measurable variables. Various measurement scales are then created to be used as predictors in the regression model. The stability of the predictor selection as well as the estimated predicted validity and the reliability of the prediction scales is examined by Monte Carlo simulations. Varying the magnitude of the measurement error variance four sets of predictors are compared: all variables, a stepwise selection, factor sums, and factor scores. The results indicate that the factor scores offer a stable method for predictor selection, whereas the other alternatives tend to give biased results leading more or less to capitalizing on chance.
مقدمه انگلیسی
The predictor selection of the linear regression model is affected, not only by the sampling variation, but also by the measurement errors. Let us assume that a predictor, say, x is measured with error. We can express this as x=τ+εx=τ+ε, where ττ is the true value of the predictor and εε is the random measurement error. It is reasonable to assume that εε is uncorrelated with ττ, and hence we can write the variance of x as var(x)=var(τ)+var(ε)var(x)=var(τ)+var(ε), where var(τ)var(τ) represents the sampling variation and var(ε)var(ε) represents the measurement error variation. Either of these may dominate in a given study. If the measurements are unreliable, we cannot improve the situation by increasing the sample size. Instead, we should have more accurate measurements. In many applications it would be preferable to reduce the effects of the measurement errors in the predictor selection, and hence make the models more stable. However, the measurement errors are often neglected in the statistical models, including perhaps the most widely applied one, the linear regression model. A classic treatment of measurement errors in regression models is provided by the errors-in-variables regression models, also called measurement error models ( Cheng and Van Ness, 1999 and Fuller, 1987). The fundamental assumption of those models is that each observed variable has its own true value, which is disturbed by a random measurement error. This assumption may lead to problems in the model identification, since the number of parameters to be estimated is easily greater than the number of equations. Hence, additional assumptions are needed. The usual procedure is to assume that the measurement error variances—or the reliabilities of the observed variables—are known (see, e.g., Kim and Saleh, 2005, Cheng and Van Ness, 1999, Gleser, 1992 and Fuller, 1987). This may be a reasonable assumption in the physical sciences and engineering (see, e.g., Gleser, 1992, pp. 698–699). However, in areas such as the social sciences or the behavioral sciences, it is usually unrealistic to assume that the reliabilities would be so well established that they could be treated as known. Taking independent replicate experiments to establish the magnitude of the measurement error (see, e.g., Gleser, 1992, pp. 698–699; Fuller, 1987, p. 106) does not provide a satisfactory solution in the above-mentioned fields, since the replicate measurements are seldom independent. In the most general form, the errors-in-variables regression models combine the regression model with the factor analysis model (Fuller, 1987, Section 4.3). Another method combining these two models has been called factor analysis regression ( Scott, 1966, Lawley and Maxwell, 1973 and Isogawa and Okamoto, 1980). It allows any one of the variables in the factor model to be the dependent variable and uses the regression method to solve a set of simultaneous equations. A more general approach for combining factor models and regression models is provided by the structural equation modeling ( Jöreskog, 1970 and Bollen, 1989), which allows specifying and testing complicated models and relations that include measurement errors. The focus of these models is mainly on the structural relations, the connections between the latent variables. Our approach for regression modeling with measurement errors is based on the measurement framework (Tarkkonen and Vehkalahti, 2005, Vehkalahti, 2000 and Tarkkonen, 1987), where the fundamental assumption, and thus the main difference compared with the errors-in-variables regression models, is that the observed variables are measuring a latent structure, whose dimension is considerably smaller than the number of the variables. Instead of focusing on the single true values of each observed variable, the problem of measurement is solved in a reduced true score space. This approach allows us to estimate the measurement error variances without a need to make assumptions that might be unrealistic. In certain respects, our approach comes close to the structural equation modeling, since it also employs the factor model, but instead of the connections between the latent variables we stress the connections established by the measurement scales, that is, the linear combinations of the observed variables. In this paper we take advantage of the measurement framework to study how the measurement errors affect the predictor selection of the linear regression model. We make Monte Carlo simulations based on a certain measurement structure using four different sets of predictors, which are measurement scales created within the measurement framework. Section 2 reviews the basic concepts of the measurement framework and establishes the connection with the linear regression model. Section 3 describes the settings of the simulation studies. Section 4 presents the results and Section 5 concludes.
نتیجه گیری انگلیسی
Our results based on simulation studies suggest that if the linear regression model is applied within the measurement framework approach, then the factor scores should be the predictors of choice. Several findings justify this. Firstly, the factor scores take advantage of the information from the measurement model, which separates the true variance from the measurement error variance. Hence the factor scores is the scale of the highest reliability. Secondly, the prediction scale of the factor scores gives the highest predictive validity corrected for attenuation. The attenuation correction does not work for the other scales, because they cannot separate the different sources of variation. Lastly, using the factor scores leads to stable regression coefficients, that is, on average the estimated coefficients stay on the same level, independent of the magnitude of the measurement error variance and the sample size. In addition, the coefficients of the factor score predictors are nearly always significant. Reliability, validity, and stability are quite important properties for predictors. If they are unacceptable, the coefficients and their interpretations will be easily affected by fluctuations of random measurement errors. Indeed, all the predictor sets in this study except the factor scores seem to lead to capitalizing on chance. Whether we use all the variables, a stepwise selection, or factor sums (i.e., variables weighted with 0 or 1 according to a factor structure), the results of the regression model become more or less unstable. To put it briefly, it is desirable to use the factor scores as predictors of the linear regression model, as in general, it leads to more reliable, more valid, and more stable results.