استفاده از روش شبه تصادفی برای پیش بینی و ارزیابی اهمیت متغیر در رگرسیون خطی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
24684 | 2014 | 18 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 71, March 2014, Pages 725–742
چکیده انگلیسی
A random subset method (RSM) with a new weighting scheme is proposed and investigated for linear regression with a large number of features. Weights of variables are defined as averages of squared values of pertaining t-statistics over fitted models with randomly chosen features. It is argued that such weighting is advisable as it incorporates two factors: a measure of importance of the variable within the considered model and a measure of goodness-of-fit of the model itself. Asymptotic weights assigned by such a scheme are determined as well as assumptions under which the method leads to consistent choice of significant variables in the model. Numerical experiments indicate that the proposed method behaves promisingly when its prediction errors are compared with errors of penalty-based methods such as the lasso and it has much smaller false discovery rate than the other methods considered.
مقدمه انگلیسی
Prediction problem with a high dimensional feature space is one of the most challenging tasks of contemporary applied statistics. There is a growing number of domains nowadays that produce data with a large number of features, while the number of observations is limited. Examples include microarray datasets that measure genes activity, Quantitative Trait Loci (QTL) data, drug design datasets, high-resolution images and high-frequency financial data among others. For examples and discussion, see e.g., Donoho (2000). An important and intensively studied line of research is focused on regularization, or penalty-based methods (cf. e.g., Tibshirani, 1996; Zou and Hastie, 2005). Another important approach is a method of dimensionality reduction based on the so called sure independence screening proposed by Fan and Lv (2008). Recently, Bühlmann et al. (2010) have introduced a novel, computationally feasible method relying on a certain hierarchical testing algorithm. There are also approaches using information criteria modified to the high-dimensional setup; see e.g., Frommlet et al. (2012). In this paper, we propose a different approach based on the random subset method (RSM). In the RSM a random subset mm of features having cardinality |m||m| smaller than a number of potentially useful regressors MM is chosen and the problem is solved with the reduced feature space of the selected predictors. Features under consideration are assigned weights based on their performance in the constructed solution. The selection of a random subset of features and model fitting is executed BB times and a cumulative weight of a feature is calculated based on its relevance in all models where it is used. The cumulative weights (or scores) thus correspond to relative importance of variables in the considered problem. The variables are then ordered according to the assigned weights. The ordering is essential in a construction of a final model, which can be e.g. based on a predetermined number of the most significant predictors or obtained by a selection method applied to the hierarchical list of models given by the ordering. By choosing |m||m| much smaller than MM the problem of overfitting is circumvented. Note that in an extreme case when |m|=1|m|=1 the scores correspond to the individual performance of variables. The procedure was proposed by Ho (1998) for classifying objects and independently by Breiman (2001) for the case when a considered prediction method was either a classification or a regression tree. Breiman’s approach leads to a construction of a random forest. There, a score of a feature corresponds to the difference of prediction errors averaged over trees which used this feature and its analogue for which values of the variable are randomly permuted. For the important developments, see also Lai et al. (2006) and Draminski et al. (2008). The RSM method belongs to the category of wrappers in the sense that feature selection is ‘wrapped around’ building a prediction method i.e. it is inherent part of its construction. Here, all variables are ranked first based on their averaged performance in small fitted models and then selection of variables is performed for ensuing hierarchical family of models with the use of cross-validation or independent test sample. Different group, called filters, includes methods for which the feature selection method is not related to the construction of a prediction tool. Such methods can perform ranking and variable selection simultaneously; for a representative example see e.g., Stoppiglia et al. (2003). It should be stressed however, that since ranking of variables in the RSM is based on fitting small linear models the method does not impose any conditions on the number of candidate variables MM. In classification and regression it is proved to be an effective way to avoid pitfall of curse of dimensionality in situations when the number of features MM is comparable or even significantly larger than the sample size nn. A related problem, which is also addressed by the RSM, is assignment of weights to the variables in such a way that their magnitudes would correspond to variables’ usefulness in the prediction. Note that this problem is different from the explanation problem in which we try to determine significant variables in the ‘true’ model, that is a model which fits the data well. A variable may be important for prediction although it does not belong to the set of significant features even in the ideal case when the data conform to a certain model like a linear model (1). The problem of assigning scores to features which reflect their importance in prediction when the number of features is small compared to the sample size is also an important line of research; cf. Grömping (2007) for comprehensive review. Here, important development is the method proposed by Lindemann, Merenda and Gold (lmg); see Lindemann et al. (1980) and also Chevan and Sutherland (1991). In this approach, the score of variable xx equals an average over all permutations rr of View the MathML sourceΔRx,r2, where View the MathML sourceΔRx,r2 is an increase of coefficient of determination R2R2 due to adding xx to the list of active variables ordered by rr (see in Section 4 for a formal definition). The method was further developed by Feldman (1999), who considered data dependent weights with their magnitude corresponding to the goodness-of-fit of the ordering. Note however, that for large MM both approaches become extremely computationally intensive and they break down in the case of linear model fitting when MM is larger than nn. We also mention the weight assignment method based on Multivariate Adaptive Regression Splines (MARS) discussed in Section 4.2. The aim of the paper is twofold. First, for a linear regression we introduce a new scheme of assigning scores to variables. In our approach variables in a randomly chosen subset are assigned weights equal the squared values of the respective tt-statistics in the pertaining fitted model. We argue in Section 2 that this is an intuitively sound choice of weights as Eq. (3) indicates that the square of tt-statistic is a product of two factors, one of which corresponds to the importance of variable within the model and the second to the importance of the model itself. Second, we investigate models based on the ordered features according to the proposed weighting and study their prediction strength by means of simulations. We establish some formal properties of the proposed scheme, namely we determine the form of asymptotic ordering of the variables when the subset is fixed (Theorem 1) and we establish the asymptotic form of weights assigned by the RSM (Theorem 2 and Theorem 3). In the case of fixed subset mm and random regressors the ordering is asymptotically equivalent to that given by the multiple correlation coefficients of yy and variables in mm with a single consecutive variable dropped. This is an extension of Zheng and Loh (1995) result, who have shown that in the case when mm contains all relevant variables the obtained ordering is such that the relevant variables precede all spurious ones. The prediction accuracy of the RSM based approach is compared with that of the lasso by means of numerical experiments and its performance appears to be promising, especially when there are many potential strongly dependent regressors. It turns out that in the considered examples false discovery rate for the RSM is much smaller than for the other methods whereas positive selection rate for all of them is comparable and close to 1. Also, we compared the pertaining method of weight assignment with Breiman’s measures and a weight assignment based on MARS. The paper is structured as follows. Properties of the tt-based ordering for a fixed subset of regressors are studied in Section 2 along with the examples in which the explicit conditions are given for it to be the correct one. Section 3 introduces the random subspace method and states the results for considered weighting scheme and Section 4 summarizes the outcomes of numerical experiments. Proofs of the results are relegated to the Appendix. We define now a formal setup of the paper. Assume that observed data have the form View the MathML source(Y,X), where View the MathML sourceY=Yn is an n×1n×1 vector of nn responses which variability we would like to explain and View the MathML sourceX=Xn is an n×Mn×M design matrix consisting of vectors of MM potential regressors collected from nn objects. Responses are related to regressors by means of the linear model equation(1) View the MathML sourceY=Xβ+ε, Turn MathJax on where View the MathML sourceε=(ε1,…,εn)′ is an unobservable vector of errors, assumed to have View the MathML sourceN(0,σ2I) distribution. Vector View the MathML sourceβ=(β1,…,βM)′ is an unknown vector of parameters. We consider two scenarios: the case of deterministic and random View the MathML sourceX. In the latter case rows of View the MathML sourceXn constitute nn independent realizations of MM-dimensional random variable View the MathML sourcex and a vector View the MathML sourceY consists of nn realizations of a random variable View the MathML sourcey=x′β+ε. A distribution of View the MathML sourcex=(x1,…,xM)′ may be arbitrary, in particular the distribution of its first coordinate may be point mass at 1 corresponding to the linear model with an intercept included. Number of potential predictors M=MnM=Mn may depend on nn. Our results will either concern the case when MM remains fixed (Theorem 2 and Corollary 3) or changes with the sample size (Theorems 1, 3 and Corollary 4). In particular MM may be much larger than nn, compare e.g. condition (ii) of Theorem 3, where it is only assumed that logMn=o(n)logMn=o(n). Suppose that some covariates are unrelated to the prediction of View the MathML sourceY, so that the corresponding coefficients βiβi are zero. Model containing all relevant variables, i.e. those pertaining to nonzero βiβi, will be called a true model. The minimal true model i.e. such that it pertains only to relevant variables will be denoted by tt and |t||t| will be the number of nonzero coefficients. It is assumed that t⊂{1,2,…,M}t⊂{1,2,…,M} and tt does not change with nn.
نتیجه گیری انگلیسی
We proposed the random subset method with a new weighting scheme which leads to a novel linear model selection method. It is investigated theoretically and by means of numerical experiments for linear models with a large number of features. We also examined its performance on several real datasets. We have shown in Theorem 2 and Theorem 3 that a weight attributed to a variable is asymptotically equivalent to a relative increment of the mean squared error of prediction when the variable is omitted from the model mm of a fixed size, averaged over all models mm containing it. Numerical experiments for synthetic data sets generated from linear models indicate that the RSM usually works at least on par with the lasso and is frequently superior to it, especially when dependence between predictors is strong or the number of true model variables is small relatively to the number of potential regressors. Moreover, interestingly, the RSM has smaller false discovery rate than the lasso. Similar observations hold for real datasets. Here we also compared both methods to the random forests. For some real datasets, as Mtp2, its extension Mtp3 and some benchmarks considered the random forests yield smaller prediction errors than the RSM but at the price of including many more variables in the model. Although the RSM is much more computationally intensive than the lasso it is not prohibitively so. Its weighted variant WRSM described in Section 4 is less computationally intensive without losing positive features of the original method. Also, numerical experiments for synthetic datasets show that weights of variables pertaining to the RSM yield clear indication of variables contributing to the model from which data is generated.