یک روش تشخیصی برای ویژگی انتخاب همزمان و شناسایی پرت در رگرسیون خطی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24310||2010||13 صفحه PDF||سفارش دهید||7439 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 54, Issue 12, 1 December 2010, Pages 3181–3193
A diagnostic method along the lines of forward search is proposed to simultaneously study the effect of individual observations and features on the inferences made in linear regression. The method operates by appending dummy variables to the data matrix and performing backward selection on the augmented matrix. It outputs sequences of feature–outlier combinations which can be evaluated by plots similar to those of forward search and includes the capacity to incorporate prior knowledge, in order to mitigate issues such as collinearity. It also allows for alternative ways to understand the selection of the final model. The method is evaluated on five data sets and yields promising results.
Outliers substantially complicate the already difficult task of model selection in linear regression. The question of which features to select as well as how many of the chosen features to select can both be grossly influenced by outliers, and to make things even tougher, the features that are selected in a model will influence which observations are considered outliers. Robust model selection, however, brings complications even beyond its statistical framework. For instance, one complication of outlier detection is that a point which statistical methods deem an outlier could in fact be the most important observation in the data set depending on the application and the cause for the outlier. Forward search is one remedy for this in that it identifies outlying points of various magnitudes and creates plots that highlight the effect of each observation on various inferences made in linear regression. It thereby provides several possible good models and gives the analyst a way to visualize these. The goal of this paper is to extend these ideas to the case of simultaneous feature selection and outlier detection, which has arisen more recently in the literature. We organize the rest of the paper as follows: the remainder of this section provides a brief literature review, discussing other ways the problem of simultaneous feature selection and outlier detection has been tackled and how diagnostic methods have evolved. Section 2 provides a review of forward search. Section 3 discusses the method that we propose in this paper, which we call backward selection search. Section 4 presents the output of our method on five well-known data sets. Finally, conclusions are given in Section 5.
نتیجه گیری انگلیسی
We have proposed a diagnostic method for simultaneous feature selection and outlier detection, which performs quite promisingly on the data sets we have used. Our method, like forward search, allows one to assess the effect of outliers on linear regression inferences. However, it also allows us to assess feature diagnostics, and allows us to visualize a sequence of models of interest. We can use any of the methods discussed in this paper to decide when to stop the sequencing. In addition, we can easily incorporate prior knowledge and hence control for issues arising because of rank-deficient matrices. Our method still has the weakness that it is difficult to find a computationally efficient way to choose a good initial subset. This will be the direction of future work.