یک تحلیل رگرسیون خطی چند متغیره با استفاده از مخلوط محدود از توزیع های TT
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24680||2014||13 صفحه PDF||سفارش دهید||9738 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 71, March 2014, Pages 138–150
Recently, finite mixture models have been used to model the distribution of the error terms in multivariate linear regression analysis. In particular, Gaussian mixture models have been employed. A novel approach that assumes that the error terms follow a finite mixture of tt distributions is introduced. This assumption allows for an extension of multivariate linear regression models, making these models more versatile and robust against the presence of outliers in the error term distribution. The issues of model identifiability and maximum likelihood estimation are addressed. In particular, identifiability conditions are provided and an Expectation–Maximisation algorithm for estimating the model parameters is developed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared to the estimators from the Gaussian mixture models. Results from the analysis of two real datasets are presented.
Linear regression analysis (see, e.g., Srivastava, 2002) is a technique that allows for the study of the dependence of DD responses View the MathML sourceY=(Y1,…,Yd,…,YD)′ on PP regressors (X1,…,Xp,…,XP)′(X1,…,Xp,…,XP)′, where D≥1D≥1 and P≥1P≥1. Linear regression is based on the following statistical model: equation(1) View the MathML sourceYi=β0+B′xi+ϵi, Turn MathJax on where the symbol ii is used to denote a sample unit; View the MathML sourceYi=(Yi1,…,Yid,…,YiD)′ and View the MathML sourcexi=(xi1,…,xip,…,xiP)′ are the DD-dimensional vector of the response variables and the PP-dimensional vector of the fixed regressor values for the iith unit, respectively; View the MathML sourceβ0 is a DD-dimensional vector containing the intercepts for the DD responses; View the MathML sourceB is a matrix of dimension P×DP×D whose (p,d)(p,d)th element, βpdβpd, is the regression coefficient of the ppth regressor on the ddth response; finally, View the MathML sourceϵi denotes the DD-dimensional random vector of the error terms corresponding to the iith unit. In the classical linear regression model, it is additionally assumed that View the MathML sourceϵi,i=1,…,I, are independent and identically distributed random vectors with a Gaussian distribution with a DD-dimensional zero mean vector and a positive definite covariance matrix View the MathML sourceΣ of dimension D×DD×D: equation(2) View the MathML sourceϵi∼ND(0,Σ). Turn MathJax on Many extensions of this classic model have been proposed to broaden the applicability of linear regression analysis to situations where the Gaussian error term assumption may be inadequate, for example, because of outlying values in the responses or datasets involving errors with longer than normal tails. Some such extensions rely on the use of the tt distribution (see, e.g., Lange et al., 1989, Sutradhar and Ali, 1986 and Zellner, 1976). In particular, a linear regression analysis has been developed by replacing (2) with the assumption equation(3) View the MathML sourceϵi∼tD(0,Σ,ν), Turn MathJax on where View the MathML sourcetD(μ,Σ,ν) denotes the DD-dimensional tt distribution with location parameter View the MathML sourceμ∈RD, dispersion matrix View the MathML sourceΣ∈RΣD and degrees of freedom ν∈R+ν∈R+, where View the MathML sourceRΣD is the set of all positive definite matrices in RD×DRD×D. However, in practice, when nothing is known about the true distribution of the error terms, a linear regression analysis based on any of the above models may be performed using an incorrectly specified model. Furthermore, there may be situations where a single parametric family is unable to provide a satisfactory model for local variations in the observed data. To overcome these problems, solutions that use finite mixture models have been recently proposed. Namely, Bartolucci and Scaccia (2005) and Soffritti and Galimberti (2011) have developed methods for linear regression analysis by assuming a finite mixture of Gaussian components for the error terms. More specifically, in the linear regression model obtained using this approach, the assumption (2) is replaced with equation(4) View the MathML sourceϵi∼∑k=1KπkND(δk,Σk), Turn MathJax on where πkπk’s are positive weights that sum to 1, the View the MathML sourceδk’s are DD-dimensional mean vectors that satisfy the constraint View the MathML source∑k=1Kπkδk=0 and the View the MathML sourceΣk’s are positive definite covariance matrices. In this paper, we extend this approach by assuming that the distribution of each component belongs to the class of tt distributions. The rationale of such an approach is that quite complex distributions can be modelled through a finite mixture model, and thus, a more flexible modelling of the unknown error distribution of a linear regression model can be obtained. In addition, using a finite mixture model makes it possible to capture the effect of omitting relevant nominal regressors from the model. In this case, the source of unobserved heterogeneity introduced in the model will affect the error terms, whose distribution will be a mixture of KK components, where KK equals the number of categories obtained from the cross classification of the omitted nominal regressors. Thus, an approach based on the finite mixture model should detect the presence of such unobserved heterogeneity in the linear regression model. The model obtained under this new assumption may be particularly suitable whenever the tails of the distribution of the error terms in each component of the mixture model are heavier than those of the Gaussian distribution (Peel and McLachlan, 2000); furthermore, this model protects against the presence of outlying residuals. The remainder of the paper is organised as follows. Section 2 provides the details of this novel class of models. In Section 2.1, we describe the multivariate linear regression model in which the error term distribution is a finite mixture of tt distributions; model identifiability and maximum likelihood (ML) estimation using an Expectation–Maximisation (EM) algorithm are addressed in Section 2.2 (proofs of some results are provided in Appendix A and Appendix B). In Section 3, we present the results of Monte Carlo experiments, which provide numerical evaluations of the main properties of the estimators of the model regression coefficients. In Section 4, we report results obtained by applying the proposed methodology and other existing methods to two real datasets. Properties concerning the tt distribution that are used in this paper are summarised in Appendix C.
نتیجه گیری انگلیسی
In this paper, a novel approach to multivariate linear regression analysis has been developed based on the use of finite mixtures of tt components. This approach includes some previously proposed solutions, namely, the classical linear models in which the error terms are assumed to follow a Gaussian or a tt distribution and the models using a finite mixture of Gaussian components. In a sense, each of these models is broadened by the proposed approach because our approach not only detects and captures the effect of the relevant nominal regressors omitted from the model but additionally provides better estimates of the regression coefficients when the distribution of the error terms is characterised by the presence of outlying observations and/or heavy tails. Furthermore, the experimental results obtained from the analysis of two real datasets provide support for the usefulness and effectiveness of our proposal. The proposed approach can be made more flexible and versatile by defining a broader class of multivariate linear regression models, obtained by parameterising the matrices View the MathML sourceΣk’s in terms of their eigenvalues and eigenvectors and by constraining some of those parameters to be the same ∀k∀k (see Banfield and Raftery, 1993 and Celeux and Govaert, 1995, for the use of the eigenvalue decomposition of component-covariance matrices in Gaussian model-based cluster analysis). Similarly, parsimonious models can be obtained by exploiting factor-analytic dimensionality reduction methods, such as the ones described in McLachlan et al. (2007) and Andrews and McNicholas (2011). The linear regression models obtained this way could provide a good fit for some datasets by using a lower number of parameters. Further developments of the methods proposed in this paper could be obtained by studying the asymptotic properties of the ML estimators and by defining statistical tests for evaluating the significance of the regression coefficients. The inclusion of suitable constraints on the model parameters in the EM algorithm could allow researchers to overcome the problems related to the possible presence of an unbounded likelihood function (see, e.g., Fernandez and Steel, 1999, Greselin and Ingrassia, 2010 and Seo and Kim, 2012). Models in which some regressors are not relevant for some responses and an estimation procedure suitable for such models could improve the usefulness of the proposed approach. Another future development is the selection of (possibly different) subsets of regressors for the DD responses. We are currently developing methods for multivariate linear regression analysis by modelling the distribution of the error terms using finite mixtures of non-elliptical and possibly skewed components, such as mixtures of normal inverse Gaussian distributions (Karlis and Santourian, 2009) or, more generally, components belonging to the family of generalised hyperbolic distributions (see, e.g., Paolella, 2007). Although the computational aspects of fitting the proposed models were not the main focus of this paper, some issues related to the implementation of the EM algorithm described in Section 2.2 deserve further investigation. In particular, evaluating the effects of different initialisation strategies (see, e.g., Biernacki et al., 2003 and Melnykov and Melnykov, 2012), the use of acceleration methods (see, e.g., Berlinet and Roland, 2012) and strategies for efficiently targeting the M steps (see, e.g., O’Hagan et al., 2012) could be interesting. Finally, to take into account the possible existence of plateaux in the log-likelihood function, stopping rules based on suitable convergence criteria could be introduced (see, e.g., McNicholas et al., 2010 and Seo and Kim, 2012).