دانلود مقاله ISI انگلیسی شماره 24718
ترجمه فارسی عنوان مقاله

استفاده از مولفه های اصلی برای تخمین رگرسیون لجستیک با بالا بعدی داده های چند راستا

عنوان انگلیسی
Using principal components for estimating logistic regression with high-dimensional multicollinear data
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
24718 2006 20 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Computational Statistics & Data Analysis, Volume 50, Issue 8, 10 April 2006, Pages 1905–1924

ترجمه کلمات کلیدی
رگرسیون لجستیک - چند خطی - اجزای اصلی -
کلمات کلیدی انگلیسی
Logistic regression, Multicollinearity, Principal components,
پیش نمایش مقاله
پیش نمایش مقاله  استفاده از مولفه های اصلی برای تخمین رگرسیون لجستیک با بالا بعدی داده های چند راستا

چکیده انگلیسی

The logistic regression model is used to predict a binary response variable in terms of a set of explicative ones. The estimation of the model parameters is not too accurate and their interpretation in terms of odds ratios may be erroneous, when there is multicollinearity (high dependence) among the predictors. Other important problem is the great number of explicative variables usually needed to explain the response. In order to improve the estimation of the logistic model parameters under multicollinearity and to reduce the dimension of the problem with continuous covariates, it is proposed to use as covariates of the logistic model a reduced set of optimum principal components of the original predictors. Finally, the performance of the proposed principal component logistic regression model is analyzed by developing a simulation study where different methods for selecting the optimum principal components are compared.

مقدمه انگلیسی

There are many fields of study such as medicine and epidemiology, where it is very important to predict a binary response variable, or equivalently the probability of occurrence of an event (success), in terms of the values of a set of explicative variables related to it. That is the case of predicting, for example, the probability of suffering a heart attack in terms of the levels of a set of risk factors such as cholesterol and blood pressure. The logistic regression model serves admirably this purpose and is the most used for these cases as we can see, for example, in Prentice and Pyke (1979). As many authors have stated (Hosmer and Lemeshow (1989) and Ryan (1997), among others), the logistic model becomes unstable when there exists strong dependence among predictors so that it seems that no one variable is important when all the others are in the model (multicollinearity). In this case the estimation of the model parameters given by most statistical packages becomes too inaccurate because of the need to invert near-singular and ill-conditioned information matrices. As a consequence, the interpretation of the relationship between the response and each explicative variable in terms of odds ratios may be erroneous. In spite of this the usual goodness-of-fit measures show that in these cases the estimated probabilities of success are good enough. In the general context of generalized linear models, Marx and Smith (1990) and Marx (1992) solve this problem by introducing a class of estimators based on the spectral decomposition of the information matrix defined by a scaling parameter. As in many other regression methods, in logistic regression it is usual to have a very high number of predictor variables so that a reduction dimension method is needed. Principal component analysis (PCA) is a multivariate technique introduced by Hötelling that explains the variability of a set of variables in terms of a reduced set of uncorrelated linear spans of such variables with maximum variance, known as principal components (pc's). The purpose of this paper is to reduce the dimension of a logistic regression model with continuous covariates and to provide an accurate estimation of the parameters of the model avoiding multicollinearity. In order to solve these problems we propose to use as covariates of the logistic model a reduced number of pc's of the predictor variables. The paper is divided into four sections. Section 1 is an introduction. Section 2 gives an overview of logistic regression. Section 3 introduces the principal component logistic regression (PCLR) model as an extension of the principal component regression (PCR) model introduced by Massy (1965) in the linear case. It also proposes two different methods to solve the problem of choosing the optimum pc's to be included in the logit model. One is based on including pc's in the natural order given by their explained variances, and in the other pc's are entered in the model by a stepwise method based on conditional likelihood-ratio-tests that take into account their ability to explain the response variable. The optimum number of pc's needed in each method (stopping rule) is also boarded in Section 3 where we propose and discuss several criteria based on minimizing the error with respect to the estimated parameters. Finally, accuracy of estimations provided by the proposed PCLR models and performance of different methods for choosing the optimum models will be tested on a simulation study in Section 4. The results will also be compared with those provided by the partial least-squares logit regression (PLS-LR) algorithm proposed by Bastien et al. (2005) for estimating the logistic regression model.

نتیجه گیری انگلیسی

This paper is focused on solving the problem of high-dimensional multicollinear data in the logit model which explains a binary response variable from a set of continuous predictor variables. In order to solve this problem and to obtain an accurate estimation of the parameters in this case, a pc-based solution has been proposed. In base to the simulation study developed in this work, where different sample sizes, number of predictors and distribution schemes have been considered, it can be concluded that the proposed PCLR models provide an accurate estimation of the parameters of a logit model in the case of multicollinearity, by using as covariates a reduced set of the pc's of the original variables. In order to select the optimum PCLR model two different methods for including pc's in the model have been considered and compared. On the one hand, Method I includes pc's in the model according to their explained variances. On the other hand, Method II considers a stepwise procedure for selecting pc's based on conditional likelihood-ratio tests. Different accuracy measures with respect to the estimated parameters have been also introduced for selecting the optimum number of pc's. Finally, Method II, which takes into account the relationship among response and predictor pc's, has been chosen as the best because it provides better parameters estimation with smaller number of pc's (bigger reduction of dimension). Finally, with respect to the comparison with PLS-LR, the PCLR model provides better estimation of the logit model parameters (less MSEB) with similar goodness-of-fit measures (MSEP and CCR) and needs less components so that the interpretation of the model parameters is more accurate.