بررسی تطبیق سر و صدا برای روش های بستن مبتنی بر برآوردگرهای رگرسیون خطی محلی غیر پارامتریک
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24278||2008||12 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 53, Issue 2, 15 December 2008, Pages 354–365
A new matching procedure based on imputing missing data by means of a local linear estimator of the underlying population regression function (that is assumed not necessarily linear) is introduced. Such a procedure is compared to other traditional approaches, more precisely hot deck methods as well as methods based on kNN estimators. The relationship between the variables of interest is assumed not necessarily linear. Performance is measured by the matching noise given by the discrepancy between the distribution generating genuine data and the distribution generating imputed values.
In several contexts, e.g. official statistics (D’Orazio et al., 2002 and D’Orazio et al., 2006), marketing (Räessler, 2002), genetics (as for the data sets in repositories like genenetwork.org), data files coming from different sources are frequently available at a moderate cost. Each data file contains the values of some of the variables of interest. This is a serious limitation, when one is interested in the joint analysis of variables that are not jointly observed. The statistical matching problem consists in constructing a complete synthetic data file, where all the variables of interest are present. In a sense, this is a purely “descriptive” objective, representing the multivariate joint distribution, with the aim to create a data set available to end-users. The synthetic data set is constructed by using imputation techniques. As a consequence the joint distribution of the variables of interest in the synthetic data file does not generally coincide with the genuine distribution. This discrepancy is the matching noise. From an end-user perspective, the smaller the matching noise, the better the reconstructed data file. Different techniques have been proposed in the literature for tackling the statistical matching problem, among them an important role is played by hot deck methods, as well as kNN methods. Their properties are studied in Paass (1985) and Marella et al. (2008), where both theoretical and simulation results are obtained. In this paper we go further by introducing new nonparametric matching techniques based on local linear regression, that are compared to existing ones. The paper is organized as follows. In Section 2 the main technical aspects are briefly introduced. In Section 3 a class of nonparametric imputation procedures are described, including the method based on the local linear estimator. In Section 4 the matching noise (for imputation based on local linear regression estimators) is formally evaluated. Finally, in Section 5 a simulation study is implemented.
نتیجه گیری انگلیسی
In this paper, a method of imputation based on the local linear estimation of the regression function of the variables of interest has been introduced and compared (in terms of matching noise) to other popular imputation techniques (hot deck methods and methods based on kNN estimators). On the theoretical ground imputation based on local linear regression is asymptotically matching noise free. Comparisons made by simulation show that the higher the complexity of the functional relationship between the predictor XX and the response variable ZZ, the better the performance of the imputation method based on the local linear regression estimator. The performance of imputation based on the local linear regression estimator is close to that of mean kNN plus random residual for the reconstruction of the marginal distribution of ZZ, and to that of the distance hot deck when the interest is in the conditional distribution of Z∣XZ∣X. As a result, this method offers an advantageous compromise for a good preservation of both the marginal and conditional distributions. As far as the bandwidth selection is concerned, LRot and LGcv generally give good results. The LGcv method performs better when the population regression function is complex, far from linearity. This result parallels analogous results obtained by Marron and Wand (1992) for nonparametric density estimation. In that case, cross validation gives good results when the density function to be estimated is particularly rough.