رگرسیون لجستیک با وزن گروه بندی قبل
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
24988 | 2013 | 18 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 64, August 2013, Pages 281–298
چکیده انگلیسی
A generalization of the commonly used Maximum Likelihood based learning algorithm for the logistic regression model is considered. It is well known that using the Laplace prior (L1L1 penalty) on model coefficients leads to a variable selection effect, when most of the coefficients vanish. It is argued that variable selection is not always desirable; it is often better to group correlated variables together and assign equal weights to them. Two new kinds of a priori distributions over weights are investigated: Gaussian Extremal Mixture (GEM) and Laplacian Extremal Mixture (LEM) which enforce grouping of model coefficients in a manner analogous to L1L1 and L2L2 regularization. An efficient learning algorithm is presented, which simultaneously finds model weights and the hyperparameters of those priors. Examples are shown in the experimental part where the proposed a priori distributions outperform Gauss and Laplace priors as well as other methods which take coefficient grouping into account, such as the elastic net. Theoretical results on parameter shrinkage and sample complexity are also included.
مقدمه انگلیسی
Variable selection problem for linear models is considered one of the most important in statistical inference (Hesterberg et al., 2008). Recently, many new variable selection methods became popular, including stagewise selection (Hastie et al., 2009) and L1L1-regularization techniques such as Lasso ( Williams, 1994, Tibshirani, 1996 and Mkhadri and Ouhourane, 2013) and LARS ( Efron et al., 1996). However, variable selection is not always the best possible approach. If the predictor variables are correlated, it is often more desirable to group correlated variables together and assign them equal or similar weights. A more detailed justification is given in Section 2, where it is argued that variable averaging may give much better results than variable selection. In order to achieve such averaging, we devise a supervised learning algorithm maximizing the log-likelihood criterion with suitably chosen priors on model weights. Our priors correspond to mixtures of Gaussian or Laplace distributions which force the weights to cluster around the means of the mixture components. The priors work analogously to L1L1 and L2L2 regularization, such that the prior based on the Gaussian distribution forces the weights to lie close to their group averages, and the prior based on the Laplace distribution forces most weights to be exactly equal to their group averages. As the resulting optimization problem is nonconvex, we present an algorithm, similar to the EM approach, consisting of two repeated steps: (1) maximization of log-likelihood for the current assignment of variables to groups, and (2) re-assignment of variables with identical or similar weights to appropriate groups. Theoretical properties of the proposed method, such as parameter shrinkage and sample complexity have also been analyzed. The advantages of coefficient grouping and averaging have, of course, already been recognized by researchers, and several methods which allow for weight grouping in regression models have been proposed. We will now review those approaches and explain the differences from the method proposed in this paper. Zou and Hastie (2005) introduce a method called elastic net , which combines L1L1 and L2L2 regularization. The L1L1 term enforces variable selection, while the L2L2 term introduces a ‘grouping effect’, thanks to which correlated variables tend to have similar coefficients. The grouping effect is, however, just a by-product of the regularization method used, and it is thus difficult to control its strength; typically it is not possible to enforce equal or approximately equal weights. In contrast, our method allows for direct control over the strength of the grouping effect (which we demonstrate theoretically) and coefficients of correlated variables can be forced to lie arbitrarily close to each other. Moreover, the prior based on the mixture of Laplace distributions allows for enforcing strict equality of most weights to their respective group averages. A technique called group Lasso has been described in Yuan and Lin (2004), Kim et al. (2006) (and introduced earlier by Bakin, 1999), which extends Lasso by taking into account the group structure of variables. However the groups need to be specified in advance and incorporated into the regularization term. Our method, on the other hand, groups variables automatically. Moreover, group Lasso does not allow for direct control over the relative sizes of weights within groups, while our approach gives the analyst precise control of the grouping behavior. If attributes are ordered in some natural way (e.g. in time series data), there is an interesting approach called fused lasso, where both large weight values and large differences between consecutive weights are penalized ( Friedman et al., 2007 and Tibshirani et al., 2005). The approach has been generalized to image data by requiring similar weights for variables corresponding to adjacent pixels. Our motivation is different, as we require whole groups of attributes to have similar weights, not just consecutive or adjacent ones. The remaining part of the paper is organized as follows: we present the motivation in Section 2, give a detailed description of the proposed method in Section 3, describe the optimization algorithm in Section 4, and analyze the approach theoretically and experimentally in Sections 5 and 6. Section 7 concludes.
نتیجه گیری انگلیسی
In the paper we study learning algorithms for linear/logistic regression which perform automatic grouping of attributes. The main conclusion of the paper is that for datasets with correlated attributes (even with a large number of such attributes) attribute selection is often a less useful approach than attribute grouping (and averaging), which often leads to better learning results. Note that the attribute selection problem is a subset of the attribute grouping problem, since one of the group centers may be very close to zero (see e.g. Fig. 3). We present two a priori mixture distributions (GEM, LEM) over weights that induce the desired grouping of correlated attributes. This fact is demonstrated experimentally and by means of shrinkage theorems. An effective learning algorithm which finds the weights, the group centers, and the assignment of weights to centers is presented. Experiments have verified high accuracy of the proposed approach usually exceeding that of other linear modeling methods. We note that for data with independent attributes our approach might give worse results than other methods. However, one should realize that truly independent attributes occur very rarely in practice, typically only in planned experiments. In real settings, correlations practically always exist, especially in large data sets. On the other hand, if we choose a suitable regularization parameter via cross-validation, our approach (LEM or GEM) will be able to choose models (through small values of the regularization parameter) which do not induce unnecessary grouping. We have also demonstrated low sample complexity for a simplified variant of the LEM prior, which uses a fixed assignment of attributes to groups. The study of sample complexity without this simplification requires a more careful analysis and will be a subject of future research.