فاصله اطمینان قوی در سطح جهانی برای رگرسیون خطی ساده
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
24313 | 2010 | 15 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 54, Issue 12, 1 December 2010, Pages 2899–2913
چکیده انگلیسی
It is well known that when the data may contain outliers or other departures from the assumed model, classical inference methods can be seriously affected and yield confidence levels much lower than the nominal ones. This paper proposes robust confidence intervals and tests for the parameters of the simple linear regression model that maintain their coverage and significance level, respectively, over whole contamination neighbourhoods. This approach can be used with any consistent regression estimator for which maximum bias curves are tabulated, and thus it is more widely applicable than previous proposals in the literature. Although the results regarding the coverage level of these confidence intervals are asymptotic in nature, simulation studies suggest that these robust inference procedures work well for small samples, and compare very favourably with earlier proposals in the literature.
مقدمه انگلیسی
Consider the simple linear regression model where we observe a bivariate random sample (Y1,X1)(Y1,X1), …, (Yn,Xn)(Yn,Xn) satisfying equation(1) View the MathML sourceYi=β0+β1(Xi−μX)+σ0ϵi,i=1,…,n, Turn MathJax on where XiXi are univariate explanatory variables, View the MathML sourceμX=med(Xi), the errors ϵiϵi follow a known distribution F0F0 and satisfy View the MathML sourcemed(ϵi|Xi)=0, i=1,…,ni=1,…,n. In general, one assumes that the data are generated by a distribution View the MathML sourceHθ belonging to a parametric family of distributions View the MathML source{Hθ}, with View the MathML sourceθ∈R2. To allow for outliers and other departures from the model, we will assume that the data follow a distribution HH in an ϵϵ-contamination neighbourhood View the MathML sourceHε(Hθ) of the true underlying parametric model. More specifically, equation(2) View the MathML sourceHε(Hθ)={H=(1−ε)Hθ+εH∗,H∗ an arbitrary distribution on R2}, Turn MathJax on where 0<ε<0.50<ε<0.5. Confidence intervals based on maximum likelihood estimators may be seriously affected by a small proportion of atypical observations (see, e.g. Tukey and McLaughlin (1963), Dixon and Tukey (1968), Huber, 1968 and Huber, 1970, Barnett and Lewis (1994), Fraiman et al. (2001) and Adrover et al. (2004)). We will say that a confidence interval is robust if it is able to maintain a high coverage level and a reasonable length when the data comes from any distribution in the contamination neighbourhood (2). Formally, we have the following: Definition 1. A confidence interval (Ln,Un)(Ln,Un) for θ∈Rθ∈R is called globally robust of level (1−α)(1−α) if it satisfies the following conditions: (1) (Stable interval. ) The minimum asymptotic coverage over the εε-contamination neighbourhood is 1−α1−α, i.e. View the MathML sourcelimn→∞infH∈Hε(Hθ)PH(Ln<θ<Un)≥1−α. Turn MathJax on (2) (Informative interval. ) The maximum asymptotic length of the interval is bounded over the εε-contamination neighbourhood, i.e. View the MathML sourcelimn→∞supH∈Hε(Hθ)[Un−Ln]<∞. Turn MathJax on It is easy to see that, for the location model, confidence intervals of the form View the MathML sourceX¯n±t(n−1)(α/2)Sn/n do not satisfy either Part 1 or 2 of Definition 1. The problem with the above confidence intervals is not solely due to the lack of robustness of the estimators View the MathML sourceX¯n and SnSn. It can be shown that even if we replace the sample mean and standard deviation by robust counterparts View the MathML sourceθˆn and View the MathML sourceσˆn, the resulting confidence interval only satisfies Part 2 of the above Definition. The failure of intervals of the form View the MathML sourceθˆn±t(n−1)(α/2)σˆn/n to satisfy Part 1 above is due to the fact that while the length of the interval converges to zero as n→∞n→∞, its center View the MathML sourceθˆn may converge to a value different from the parameter of interest θθ. This problem can be fixed taking into account the largest possible difference between View the MathML sourceθˆ(H), the limiting value of View the MathML sourceθˆn, and the parameter of interest θθ, across distributions HH in the contamination neighbourhood View the MathML sourceHε(Hθ). This quantity is related to the maximum asymptotic bias of the estimator View the MathML sourceθˆn (e.g. see Huber (1964)). For the location model View the MathML sourceYi=θ+σ0ϵi, the maximum asymptotic bias of View the MathML sourceθˆn is View the MathML sourceB(θˆ)=supH∈Hε(Hθ)|θˆ(H)−θ|σ0, Turn MathJax on and thus, View the MathML source|θˆ(H)−θ|≤B(θˆ)σ0 for all View the MathML sourceH∈Hε(Hθ). Let View the MathML sourceσˆn be an estimator of σ0σ0 with limit View the MathML sourceσˆ(H), which in principle may be different from σ0σ0. For each View the MathML sourceH∈Hε(Hθ) we have equation(3) View the MathML source|θˆ(H)−θ|≤B(θˆ)σ0=B(θˆ)σ0σˆ(H)σˆ(H)≤B(θˆ)B−(σˆ)σˆ(H), Turn MathJax on where View the MathML sourceB−(σˆ)=supH∈Hε(Hθ)σ0/σˆ(H). Tabulated values of View the MathML sourceB−(σˆ) for different scale estimators are available in Adrover and Zamar (2004). Hence, we can estimate the largest difference View the MathML source|θˆ(H)−θ| using View the MathML sourceB(θˆ)B−(σˆ)σˆn.
نتیجه گیری انگلیسی
It is easy to see that when the data may contain outliers or other departures from the assumed model, classical inference methods can be seriously affected and might yield confidence levels much lower than the nominal values. In this paper we propose robust confidence intervals for the slope of simple linear regression models. These intervals combine robust non-parametric confidence intervals for location models with bias corrections to control the minimum coverage level even in the case of contaminated samples. Our approach can be applied to any consistent estimator of the slope and intercept for which maximum bias curves are tabulated. Earlier proposals in the literature (see Adrover et al. (2004)) required estimators that are View the MathML sourcen-normal over the entire contamination neighbourhood, and also involved the estimation of bias bounds, which introduces further variability in the confidence interval, affecting their coverage levels. In addition, note that to use the approach discussed in this paper one does not need to estimate neither the scale parameter of the errors σ0σ0, nor the asymptotic standard deviation of the regression estimator. Although our derivation is asymptotic in nature, our simulation studies suggest that this approach works well for small samples. In particular, note from Table 5 and Table 6 that these new robust confidence intervals maintain coverage levels much closer to the nominal one than the previous proposals without sacrificing length. Furthermore, in most cases in these tables, the new approach yields higher coverage levels with shorter intervals. Finally, we extend these ideas to the hypothesis testing setup and derive robust procedures that maintain the level of the test over the whole contamination neighbourhood.