رگرسیون خطی خوشه عاقلانه مقاوم از طریق حذف بخش های زائد
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
24314 | 2010 | 13 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 54, Issue 12, 1 December 2010, Pages 3057–3069
چکیده انگلیسی
The presence of clusters in a data set is sometimes due to the existence of certain relations among the measured variables which vary depending on some hidden factors. In these cases, observations could be grouped in a natural way around linear and nonlinear structures and, thus, the problem of doing robust clustering around linear affine subspaces has recently been tackled through the minimization of a trimmed sum of orthogonal residuals. This “orthogonal approach” implies that there is no privileged variable playing the role of response variable or output. However, there are problems where clearly one variable is wanted to be explained in terms of the other ones and the use of vertical residuals from classical linear regression seems to be more advisable. The so-called TCLUST methodology is extended to perform robust clusterwise linear regression and a feasible algorithm for the practical implementation is proposed. The algorithm includes a “second trimming” step aimed to diminishing the effect of leverage points.
مقدمه انگلیسی
Most of non-hierarchical clustering methods are based on the idea of forming clusters around “objects”. These objects are geometrical structures which represent the typical behavior of data belonging to each group. The search for these objects is often done by minimizing a criterium based on distances of data points to their closest objects. When data come from a population model, the objects become population features which are interesting to be understood and estimated. The first kind of objects considered in the literature of Cluster Analysis were “centers”, that gave rise to the well-known kk-means method (McQueen, 1967 and Hartigan and Wong, 1979) as well as to its robust version, the trimmed kk-means method (Cuesta-Albertos et al., 1997). However, the presence of clusters in a data set is sometimes due to the existence of certain relations among the measured variables which may be different depending on some unknown or hidden factors. Thus, observations can be grouped in a natural way around more complex objects which can adopt the form of linear or nonlinear structures. Clustering around linear affine subspaces and more general manifolds is receiving a lot of attention in the literature. This is partly motivated by the existence of interesting fields of application like computer vision, pattern recognition, tomography, fault detection, etc., where this type of cluster usually appears. A common feature in all these application fields is that noisy data frequently appear. Therefore protection against outliers is a desirable property to be required for any designed procedure. García-Escudero et al. (2009) contains a large collection of references that reflect the state of art on clustering around linear structures. That paper also presented a new approach aimed at performing robust clustering around linear subspaces. The approach is based on the minimization of a trimmed sum of orthogonal residuals and it may be seen as a robust version for the Linear Grouping Algorithm introduced by Van Aelst et al. (2006). It may be also viewed as a robust extension of the classical Principal Components Analysis (PCA) to the Cluster Analysis setup (see Serneels and Verdonck (2008) for a recent review on robust PCA method). The “orthogonal” choice for the residuals implies that there is no privileged variable to be used as a response or output variable. However, there are situations where the interest clearly rests on explaining one variable in terms of other ones and the use of classical linear regression residuals seems to be more advisable. This would allow for addressing classical tasks in Regression Analysis like, for instance, understanding the role of each explanatory variable, forecasting the response variable, validating the model through analysis of residuals, and so on. To this different approach we can call “clusterwise linear regression”. Several approaches related to clusterwise linear regression can be found in the literature. Hosmer (1974) and Lenstra et al. (1982) are examples of early references devoted to the “two regression lines” case. In the econometric and chemometric settings we can find the “switching regression model” which was introduced from a maximum likelihood point of view in Goldfeld and Quandt (1976) and has also been later addressed through a Bayesian approach (e.g., Hurn et al. (2003)). In the Machine Learning community literature, we can also find the “multiple model estimation” approach (see Cherkassky and Ma (2005), and references therein). In these two mentioned approaches, the emphasis is mainly put on aspects related to the fit of the model instead of clustering aspects. Another interesting reference is Hennig (2003) where a “fixed point” approach was analyzed. Finally, Neykov et al. (2007) has recently introduced a mixture fitting approach based on trimmed likelihood that will be discussed later. In this paper we extend the TCLUST methodology in García-Escudero et al. (2008) to the context of robust clusterwise linear regression. The TCLUST methodology was there introduced to perform robust clustering assuming multivariate normal clusters with different sizes and different covariance matrices. The approach relies on modeling the data through an adaptation of the “spurious-outliers model” (Gallegos and Ritter, 2005). In the here proposed extension, the TCLUST methodological flexibility will allow for different scatters for the regression errors together with different group weights. A constraint on the error term scatters is also needed in order to avoid singularities on the objective function defining the problem. Apart from dealing with a clusterwise regression problem, the main difference with the previously developed TCLUST methodology relies on how kk scatter parameters are now controlled instead of the eigenvalues of kk covariance matrices. The possibility for a further “second” trimming is now also considered. The structure of the paper is as follows. Section 2 is devoted to present the proposed methodology explaining the role of all its ingredients. An algorithm inspired by this methodology is outlined in this section. A detailed description of the algorithm can be found in the Appendix. The importance of imposing a scatter similarity constraints is discussed in Section 3. An additional “second” trimming step is also presented in Section 4. This step is designed to improve the protection against outliers in xx which is an interesting feature in Regression Analysis. That further trimming allows to diminish the effect of some leverage points that (although they do not break down the procedure) could entail important biases in the determination of the underlying linear structures. This second trimming is also appropriate to avoid classification errors that sometimes occur due to the artificial elongations of influence zones of linear clusters. A simulation study is carried out in Section 5. Some applications based on real data sets are developed in Section 6. These examples illustrate the interest of the presented methodology to perform allometric studies in Biology. Finally, some conclusions and further directions are presented in Section 7.
نتیجه گیری انگلیسی
In this paper, we have presented a clusterwise linear regression methodology able to handle different group scatters and weights together with the presence of certain amount of outlying observations. An algorithm arising from adapting the TCLUST algorithm to this framework is also given. A restriction constraining the group scatters ratio is imposed to have the theoretical problem well-defined and to avoid spurious solutions. The proposed algorithm considers the possibility of performing two different ways of trimming data. The so-called “second trimming” is useful in order to diminish the effect of harmful leverage points. There still exist open problems such as how to choose the tuning parameters kk, α1α1, α2α2 and cc. Sometimes, the user can make initial guesses of sensible values for them. However, appropriate values for these parameters are most of times completely unknown and the determination of reasonable values for them is still needed. If no very strong constrains are wanted to be posed by the researcher, the constant cc is just a technical parameter that mainly serves to avoid degeneracies of the algorithm. Thus, it does not play a key role once pathological solutions are discarded. Our experience shows that values of cc ranging from 5 to 10 seem to be reasonable in most of the cases. Anyway, we can always analyze whether the group scatter constrain was forced in the final solution by examining the output of the algorithm. With respect to the trimming proportions, we recommend choosing α1α1 and α2α2 in a “preventive” fashion. i.e., it is better to chose values for α1α1 and α2α2 a bit larger than needed and, therefore, surely removing all the outlying observations together with some non-outlying ones. Afterwards, we can apply standard Regression Analysis tools to recover the observations that should not have been trimmed off. These tools may be also considered to break some artificial assignments made when the influence zones of the linear clusters are extended. For instance, “leverages” and “standardized residual” are well-known diagnostic tools (see, e.g. Myers (1990)) that may be here applied. The leverage hiihii for a given observation ii is defined through equation(5) View the MathML sourcehii=1n+MDi2n−1withMDi=(xi−m)′S−1(xi−m), Turn MathJax on with mm and SS being location and scatter matrix estimators only based on the values taken by the explanatory variables (MDii’s are well-known Mahalanobis distances). The standardized residuals for the observation ii is defined as equation(6) View the MathML sourceri=yi−yˆiσˆ1−hii, Turn MathJax on where View the MathML sourceyˆi is the (regression) fitted value for yiyi and View the MathML sourceσˆ is an estimator of the scatter of the regression error terms. Assume that View the MathML sourceθˆ=(πˆ1,…,πˆk,βˆ1,…,βˆk,σ̂1,…,σ̂k) are the optimal values returned by the proposed algorithm and View the MathML sourceĤ={H1̂,…,Hk̂} are the indices of the observations surviving the two trimmings with View the MathML sourcenj=#Hĵ. We can use (5) with mm and SS replaced by some location and scatter matrix estimators based only on the observations whose indices belong to View the MathML sourceHĵ (recall that these observations are supposed to constitute the “core” of the linear structure). We so obtain some “pseudo-leverages” View the MathML sourcehiij for i=1,…,ni=1,…,n and j=1,…,kj=1,…,k. The View the MathML sourcehiij’s need to be truncated to a value close to 1 whenever View the MathML sourcehiij>1. Analogously, we can define some “pseudo-standardized residuals” View the MathML sourcerij through View the MathML sourcerij=yi−yˆijσĵ1−hiijfor i=1,…,nandj=1,…,k, Turn MathJax on where individuals group scatter estimators View the MathML sourceσĵ’s are considered and View the MathML sourceyˆij denotes the fitted value for yiyi obtained by the linear regression associated to cluster jj, i.e. View the MathML sourceyˆij=xi′βˆj. In order to perform the final cluster assignment, we assign the observation XiXi to the cluster jj whenever View the MathML sourcehiij<3⋅p/nj and View the MathML source|rij|<3 (these are standard cut-off values in Regression Analysis). If an observation satisfies these conditions for two or more jj’s, a rule to break ties based on the sizes of View the MathML sourcehiij and/or View the MathML source|rij| should be applied. Fig. 7(a) shows the values for View the MathML sourcehiij’s and View the MathML source|rij|’s based on the clustering results shown in Fig. 2(b). The horizontal dashed lines are the applied cut-off values. Full-size image (31 K) (a) Residuals and leverages. Figure options Full-size image (27 K) (b) Final assignment. Figure options Fig. 7 “Pseudo”-leverages and “pseudo”-standardized residuals (absolute values) together with the cut-off values (horizontal dotted lines) are shown in (a). The final cluster assignments starting from the clustering results in Fig. 2(b) are shown in (b). This simple idea provides a final refinement of the initial solution which entails better assignments and incorporates some wrongly discarded observations. Compare the final clustering results in Fig. 7(b) with those in Fig. 2(b). This refinement stage makes the choice of parameters α1α1 and α2α2 less crucial because of the possibility of recovering trimmed observations and it also provides some validation tools in order to evaluate the adequacy of the choice of kk and cc. When α1α1 and α2α2 are fixed, we can perhaps resort to BIC calculus to derive appropriate choices for kk by handling penalized “trimmed” likelihoods as it was done in Neykov et al. (2007). Anyway, this possibility needs to be carefully explored.