رگرسیون لجستیک منظم بدون مدت پنالتی: درخواست برای طبقه بندی سرطان با داده های میکرو آرایه ای
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
24869 | 2011 | 9 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 38, Issue 5, May 2011, Pages 5110–5118
چکیده انگلیسی
Regularized logistic regression is a useful classification method for problems with few samples and a huge number of variables. This regression needs to determine the regularization term, which amounts to searching for the optimal penalty parameter and the norm of the regression coefficient vector. This paper presents a new regularized logistic regression method based on the evolution of the regression coefficients using estimation of distribution algorithms. The main novelty is that it avoids the determination of the regularization term. The chosen simulation method of new coefficients at each step of the evolutionary process guarantees their shrinkage as an intrinsic regularization. Experimental results comparing the behavior of the proposed method with Lasso and ridge logistic regression in three cancer classification problems with microarray data are shown.
مقدمه انگلیسی
Logistic regression ( Hosmer & Lemeshow, 2000) is a simple and efficient supervised classification method that provides explicit probabilities of class membership and an easy interpretation of the regression coefficients of predictor variables. The class variable is binary while the explanatory variables are of any type, not even requiring strong assumptions, like gaussianity of the predictor variables given the class or assumptions about the correlation structure. This lends great flexibility to this approach having shown a very good performance in a variety of fields ( Baumgartner et al., 2004 and Kiang, 2003). Many of the most challenging current classification problems involve extremely high dimensionality k (thousands of variables) and small sample sizes N (less than one hundred cases). This is the so-called “large k, small N” problem, since it hinders proper parameter estimation when trying to build a classification model. Microarray data classification falls into this category. In logistic regression we identify four problems in the “large k, small N” case. First, a large number of parameters – regression coefficients – have to be estimated using a very small number of samples. Therefore, an infinite number of solutions is possible as the problem is undetermined. Second, multicollinearity is largely present. As the dimensionality of the model increases, the chance grows that a variable can be constructed as a linear combination of other predictor variables, thereby supplying no new information. Third, over-fitting may occur, i.e. the model may fit the training data well but perform badly on new samples. These problems yield unstable parameter estimates. Fourth, there are also computational problems due to the large number of predictor variables. Traditional algorithms for finding the estimates numerically, like Newton–Raphson’s method ( Thisted, 1988), require prohibitive computations to invert a huge, sometimes singular matrix, at each iteration. Within the context of logistic regression, the “large k, small N” problem has been tackled from three fronts: dimensionality reduction, feature (or variable) selection and regularization, or sometimes a combination of them. As regards dimensionality reduction, principal components analysis is one of the most widespread methods ( Aguilera, Escabias, & Valderrama, 2006). This preprocessing of high-dimensional variables outputs transformed variables, of which only a reduced set is used. These transformed variables are the classifier inputs. The main drawback is that principal components tend to need all the original variables in their expressions. As a result, the information requirements of model application are not reduced and there is also a loss of interpretability of the variables. Furthermore, there is not guarantee of class separability coinciding with the selected principal components ( Weber, Vinterbo, & Ohno-Machado, 2004). Other methods, such as partial least squares ( Antoniadis, Lambert-Lacroix, & Leblanc, 2003) or an adaptive dimension reduction through regression ( Nguyen & Rocke, 2002) have also been used. Feature selection methods yield parsimonious models which reduce information costs, are easier to explain and understand, and increase model applicability and robustness. The selected features are good for discriminating between the different classes and may be sought via different heuristic search approaches ( Liu & Motoda, 2008). The goodness of a proposed feature subset may be assessed via an initial screening process using a scoring metric. The metric is based on intrinsic characteristics of the data computed from simple statistics on the empirical distribution, totally ignoring the effects of the selected features on classifier performance. This is the so-called filter approach to feature selection in machine learning, or screening in statistics ( West et al., 2001). By contrast, the wrapper approach searches good subsets using the classifier itself as part of their function evaluation ( Kohavi & John, 1997). A performance estimate of the classifier trained with each subset assesses the merit of this subset. Some recent studies combine filter and wrapper approaches ( Uncu & Türksen, 2007). In the context of logistic regression and k ⪢ N, Lee, Lee, Park, and Song (2005) propose different filter metrics to select a fixed number of features, the top-ranked ones, such that they are always fewer than the sample size.Avoiding the curse of dimensionality in a similar way, Weber et al. (2004) perform a preliminary feature selection by choosing the N − 1 variables maximally correlated with the class variable. In a second phase, a logistic regression model is constructed with the selected features, and it is further simplified via a backwards variable selection. The third front to tackle the “large k, small N” problem is using regularization methods. These methods impose a penalty on the size of logistic regression coefficients, trying to shrink them towards zero. Therefore, regularized estimators are restricted maximum likelihood estimators (MLE), since they maximize the likelihood function subject to restrictions on the logistic regression parameters. The little bias allowed provides more stable estimates with smaller variance. Regularization methods are more continuous than usual discrete processes of retaining-or-discarding features thereby not suffering as much from high variability ( Hastie, Tibshirani, & Friedman, 2001). This shrinkage of coefficients was initially introduced in the ordinary linear regression scenario by Hoerl and Kennard (1970), where restrictions were spherical. This is the so-called ridge or quadratic (penalized) regression. Lee and Silvapulle, 1988 and LeCessie and vanHouwelingen, 1992 extended the framework to logistic regression. Ridge estimators are expected to be on average closer to the real value of the parameters than the ordinary unrestricted MLEs, i.e. with smaller mean-squared error. See Fan and Li, 2006 and Bickel and Li, 2006 for recent developments and a unified conceptual framework of the regularization theory. Here we introduce estimation of distribution algorithms (EDAs) as intrinsic regularizers within the logistic regression context. EDAs are optimization heuristics included in the class of stochastic population-based search methods ( Larrañaga and Lozano, 2002, Lozano et al., 2006 and Pelikan, 2005). EDAs work by constructing an explicit probability model from a set of selected solutions, which is then conveniently used to generate new promising solutions in the next iteration of the evolutionary process. In our proposal, an EDA obtains the regularized estimates in a direct way in the sense that the objective function to be optimized is still the likelihood, not including any regularization term. It is a specifically chosen simulation process during the evolution which accounts intrinsically for the regularization. EDAs receive the unrestricted likelihood equations as inputs and generate the restricted MLEs as outputs. The paper is organized as follows. Section 2 reviews both the classical and regularized versions of the logistic regression model. Section 3 describes EDAs and how we propose to use them to solve the regularized case. Experimental studies on several microarray data sets, a great exponent of the “large k, small N” problem, are presented in Section 4. Finally, Section 5 includes some conclusions and future work.
نتیجه گیری انگلیسی
We have introduced a novel EDA-based approach that finds a regularized logistic classifier. EDA is not influenced by situations where the number of covariates is relatively large compared to the number of observations. By including the shrinkage of the coefficients intrinsically during its evolution process while optimizing the usual likelihood function, our approach works like a regularized logistic classifier. EDAs receive the unrestricted likelihood equations as inputs and generate the restricted MLEs as outputs. Our proposal yields significantly better performance on the relevant AUC measure, as compared to ridge and Lasso logistic regressions. The classification accuracy achieved outperforms that of Lasso although it is worse than the accuracy obtained with ridge logistic regression. Our evolutionary strategy takes longer to find the coefficient estimates, ridge and Lasso logistic regressions being faster. However, run times are still negligible. Finally, we have shown our regularization to be effective on the stability of the regression parameter estimates. Therefore, the intrinsic regularizer presented here turns up as a good candidate in the regularized logistic regression context. Future directions to be explored are EDA approaches that take into account more complex probabilistic conditional dependencies among βi parameters, at the expense, perhaps, of a higher computational cost. Traditional numerical methods are unable to provide this kind of information. The inclusion of interaction terms among (possibly co-regulated) genes in ηj of Eq. (1) would also be feasible. Finally, unlike the traditional numerical procedures, the EDA approach could be used in a more direct way, as a method that is able to optimize any objective function, regardless of its complexity or the non-existence of an explicit formula for its expression. Thus, EDA could find parameters that maximize any regularized logistic regression (Lasso, bridge…) or even the AUC objective. The difficulty in dealing with the AUC directly as the objective function is pointed out in Ma and Huang (2005), who use an approximation to it instead. Nevertheless, it is the original and intrinsic way of shrinking the regression coefficients embedded in some EDA steps which provides our valuable contribution in this paper.