انتخاب متغیر و مرز برای داده های عملکردی از طریق مدلسازی رگرسیون لجستیک چندمنظوره
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|25003||2014||10 صفحه PDF||سفارش دهید||6140 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 78, October 2014, Pages 176–185
Penalties with an ℓ1ℓ1 norm provide solutions in which some coefficients are exactly zero and can be used for selecting variables in regression settings. When applied to the logistic regression model, they also can be used to select variables which affect classification. We focus on the form of ℓ1ℓ1 penalties in logistic regression models for functional data, in particular, their use in classifying functions into three or more groups while simultaneously selecting variables or classification boundaries. We provide penalties that appropriately select the variables in functional multiclass logistic regression models. Analysis of simulation and real data show that the form of the penalty should be selected in accordance with the purpose of the analysis.
Variable selection is a crucial issue in regression analysis. Several methods have been proposed for the accurate and effective selection of appropriate variables (see, e.g., Burnham and Anderson, 2002). The lasso by Tibshirani (1996) and its extensions or refinements (Fan and Li, 2001, Zou and Hastie, 2005 and Zou, 2006) provide a unified approach to problems of estimating and selecting variables, and for this reason they are broadly applied in several fields; an overview is provided in Hastie et al. (2009). In this paper, we consider the problem of classifying data while simultaneously selecting variables which affect the classification problem, by applying ℓ1ℓ1-type penalties to logistic regression models. The logistic regression model is one of the most useful tools for classifying data, and it does so by providing posterior probabilities which place the data in the appropriate group (McCullagh and Nelder, 1989). Logistic regression models that use ℓ1ℓ1 regularization have been investigated as generalized linear models in Park and Hastie (2007). They considered binomial logistic regression models, and we consider classifying data into three or more groups using the multinomial or multiclass logistic regression model. Krishnapuram et al. (2005) and Friedman et al. (2010) applied ℓ1ℓ1-type penalties to the model as natural extensions of the binomial logistic regression models. On the other hand, there are also multiple parameters in each variable of the multinomial logistic regression model and the multivariate linear model. There have been several studies of the ℓ1ℓ1-type regularization for the multivariate linear model. Turlach et al. (2005) proposed a new penalty that can be used to estimate multivariate linear models. They imposed an ℓ1ℓ1 sum of the maximum absolute values (ℓ∞ℓ∞ norm) of the coefficients with respect to multiple responses, and they also generalized it to the ℓ1ℓ1 sum of ℓq(q≥1)ℓq(q≥1) penalties. Following this, Yuan et al. (2007) and Obozinski et al. (2011) let the penalty be denoted by ℓ1/ℓqℓ1/ℓq and investigated its theoretical properties. It can be viewed as an extension of the group lasso (Yuan and Lin, 2006 and Meier et al., 2008). Furthermore, Obozinski et al. (2010) proposed a new algorithm for estimating a multitask logistic regression model by using the ℓ1/ℓqℓ1/ℓq regularization for q=1,2q=1,2. When the data to be classified have been measured repeatedly over time, they can be represented by a functional form. Ramsay and Silverman (2005) established this type of analysis and called it functional data analysis (FDA). FDA is one of the most useful methods for effectively analyzing discretely observed data, and it has received considerable attention in various fields (Ramsay and Silverman, 2002 and Ferraty and Vieu, 2006). The basic idea behind FDA is to express repeated measurement data for each individual as a smooth function and then to draw information from the collection of these functions. FDA includes extensions of traditional methods, such as principal component analysis, discriminant analysis, and regression analysis (James et al., 2000 and James, 2002). For regression models, there are various methods, such as a functional version of logistic regression models (Aguilera and Escabias, 2008, Aguilera-Morillo et al., 2013, Escabias et al., 2004 and Escabias et al., 2007), generalized linear models (Cardot and Sarda, 2005, Müller and Stadtmüller, 2005, Li et al., 2010 and Goldsmith et al., 2011), and generalized additive models (Reiss and Ogden, 2010). Furthermore, the problem of variable selection for functional regression models using ℓ1ℓ1-type regularization is considered in Ferraty et al., 2010, Aneiros et al., 2011, Matsui and Konishi, 2011, Zhao et al., 2012 and Gertheiss et al., 2013, and Mingotti et al. (2013). However, these works do not include the multiclass logistic regression model. For this model, we may fail to select functional variables when we use existing types of penalties, since it has multiple coefficients for multiple classification boundaries. In this paper, we consider the problem of using ℓ1ℓ1-type regularization to select the variables for classifying functional data by using the multiclass logistic regression model. Data from repeated measurements are represented by basis expansions, and the functional logistic regression model is estimated by the penalized maximum likelihood method with the help of ℓ1ℓ1-type penalties. By extending the ℓ1/ℓqℓ1/ℓq penalties, we propose a new class of penalties, denoted by ℓ1ℓ2/ℓqℓ1ℓ2/ℓq, for appropriately estimating and selecting variables or boundaries for the functional multiclass logistic regression model. Since the basis expansion produces multiple parameters for each variable and each classification boundary, we use the group lasso to treat them as grouped parameters. We here consider the cases for q=1q=1 and q=2q=2. When q=1q=1, instead of selecting the variables themselves, we select classification boundaries for each variable; however, when q=2q=2, we can select the variables that are given as functions by grouping all the coefficients for each variable. The estimated model is evaluated by a selection criterion, since its evaluation is a crucial issue. In order to investigate the effectiveness of the proposed penalty, we conducted Monte Carlo simulations and analyzed actual data. This paper is organized as follows. Section 2 provides a multiclass logistic regression model for functional data. Section 3 shows a method for estimating and evaluating the model. We apply the proposed method to the analysis of simulated and real data in Sections 4 and 5, respectively. Concluding remarks are given in Section 6.
نتیجه گیری انگلیسی
We have proposed a form of ℓ1ℓ1-type penalties for constructing the functional multinomial logistic regression model. We derived the estimation and evaluation procedures for the model with the ℓ1ℓ2/ℓqℓ1ℓ2/ℓq penalty for q=1,2q=1,2. The model was fitted by the penalized maximum likelihood method, and the regularization parameter involved in the model was selected by the model selection criterion. Monte Carlo simulations were conducted in order to investigate the effects on the accuracy of prediction and on variable selection. Results showed that, for the same types of ℓ2ℓ2 norms, the ℓ1ℓ2/ℓ1ℓ1ℓ2/ℓ1-type penalty selected better classification boundaries and obtained a smaller test error than did the ℓ1ℓ2/ℓ2ℓ1ℓ2/ℓ2-type penalty. On the other hand, the ℓ1ℓ2/ℓ2ℓ1ℓ2/ℓ2-type penalty selected better functional variables by shrinking all of the parameters for each variable towards exactly zero. Therefore, if the classification boundaries are more important than the selection of variables themselves or the accuracy of the prediction, the ℓ1ℓ2/ℓ1ℓ1ℓ2/ℓ1 penalties are preferred. On the other hand, if the objective is to select variables for a classification problem, we should choose the ℓ1ℓ2/ℓ2ℓ1ℓ2/ℓ2 penalties. Furthermore, the norms proposed by Aguilera-Morillo et al. (2013) gave better results than did the ordinary ℓ2ℓ2 norms or those of Gertheiss et al. (2013). These penalties were applied to the analysis of gene expression data, and we then investigated which types of time synchronization contributed to the classification of genes. As described in the last paragraph of Section 4, there are occasional cases in which this method does not select classification boundaries appropriately. The solution for this will be a topic of future research. Furthermore, as one reviewer pointed out, there should be a penalty which judges the accuracy of both the variable selection and the prediction. One simple idea is to use an elastic net (Zou and Hastie, 2005) to incorporate the penalties in (6). A discussion of this and other better forms of penalties and estimation procedures are left as an area of future work. Another future study should include other different penalized approaches for using multiclass logistic regression modeling to select variables and boundaries for functional data.