تجزیه و تحلیل حساسیت رگرسیون خطی محدود L1: آشفتگی ها به متغیر وابسته و پیش بینی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
25786 | 2005 | 4 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 48, Issue 4, 1 April 2005, Pages 779–802
چکیده انگلیسی
The active set framework of the reduced gradient algorithm is used to develop a direct sensitivity analysis of linear L1 (least absolute deviations) regression with linear equality and inequality constraints on the parameters. We investigate the effect on the L1 regression estimate of a perturbation to the values of the response or predictor variables. For observations with nonzero residuals, we find intervals for the values of the variables for which the estimate is unchanged. For observations with zero residuals, we find the change in the estimate due to a small perturbation to the variable value. The results provide practical diagnostic formulae. They quantify some robustness properties of constrained L1 regression and show that it is stable, but not uniformly stable. The level of sensitivity to perturbations depends on the degree of collinearity in the model and, for predictor variables, also on how close the estimate is to being nonunique. The results are illustrated with numerical simulations on examples including curve fitting and derivative estimation using trigonometric series.
مقدمه انگلیسی
Consider a linear model y=Xβ+ε, where y is an n×1 response vector corresponding to the n×p design matrix X of predictor variable values, β is an unknown p×1 vector of parameters and ε is an n×1 vector of random errors. For our purposes it will be convenient to write the model as a system of linear equations yi=xiTβ+εi, i=1,…,n, where xiT is the ith row of X. In many applications there are additional linear constraints that must be satisfied by some or all of the parameters, for example, positivity. In particular, biometric and econometric models of the form , with positive , β2 and β3, are of this type after a logarithmic transformation (see p. 444 in Judge et al., 1985). Constrained regression problems also arise naturally in the important areas of parametric (and nonparametric) curve and surface fitting, and in the estimation of solutions of ill-posed and inverse problems from noisy data (see Wahba, 1990). Here, extra information such as the value of the solution at some point leads to a linear equality constraint on the parameters. Extra information such as positivity, monotonicity, concavity or convexity of the solution leads to a set of linear inequality constraints on the parameters (see Wahba, 1982 and O'Leary and Rust, 1986). For notational simplicity we will write the linear equality constraints as xiTβ−yi=0, , and the inequality constraints as xiTβ−yi⩽0, . In the unconstrained case, it is usual to estimate β using least squares (L2) regression. For the constrained problem, restricted or constrained least squares regression (as well as other approaches) have been used (see Knautz, 2000). However, as is well known, the least squares method is not robust; it is not optimal for error distributions with long tails and the estimates are overly sensitive to outliers. Over the past 25 years there has been growing interest in the method of least absolute deviations or L1 regression as an alternative to least squares regression. For the linear model with linear constraints above, the L1 regression estimate of β is the solution to the problem (denoted LL1) equation(1.1A) equation(1.1B) equation(1.1C) where we assume that . An important advantage of L1 regression over L2 regression is its robustness. For the unconstrained problem (denoted UL1), it is well known that the L1 regression estimator can resist a few large errors in the data y. In fact (see Lemma 3.1 or Bloomfield and Steiger, 1983), the optimal solution (regression estimate) to UL1 is completely unaffected by a perturbation of y that maintains the same signs of the residuals. Bloomfield and Steiger (1983, Section 2.3) also derived the generalized influence function for L1 regression, which shows its robustness with respect to yi, but lack of robustness with respect to xi. See Huber (1987) for further discussion of the L1 approach in robust estimation. In this paper we analyse the sensitivity of the constrained L1 regression estimate to general perturbations in the data, both in the response y and in the row vectors xiT. The results show that the constrained estimate is also robust with respect to some large perturbations in y. Furthermore they show that if the estimate is unique, then it is stable, in that it depends continuously on the data y and xiT. However, the stability is not uniform but depends on the degree of collinearity in the model and, for xiT, also on how close the estimate is to being nonunique. This is consistent with the findings of Ellis (1998) who characterised the singular set in unconstrained L1 regression. It is well known that the UL1 and LL1 problems can be formulated as linear programming (LP) problems (see e.g. Arthanari and Dodge, 1981). Efficient simplex-type methods have been developed to solve these LP problems including the algorithm of Barrodale and Roberts (1978). Other methods including interior point algorithms have also been developed (see Portnoy and Koenker, 1997; Shi and Lukas, 2002 and the references there). With the LP formulation of the L1 regression problem it is possible to derive sensitivity results using the usual LP sensitivity analysis (i.e. post optimality analysis) based on the simplex method. This approach was used by Narula and Wellington (1985) for UL1 to find an interval for each data value yi for which the regression estimate is unchanged and to determine the effect of deleting an observation. In a similar way, Narula and Wellington (2002) find an interval for each predictor variable value for which the regression estimate is unchanged. Corresponding results for the case of simple linear L1 regression are derived in Narula et al. (1993). Independently, Dupačová (1992) developed a sensitivity analysis for the LP formulation of UL1 which considers the effect of perturbing the response y, perturbing the row vectors xiT, and adding or deleting an observation. We extend the results of Narula and Wellington 1985 and Narula and Wellington 2002 and Dupačová (1992) by deriving a direct sensitivity analysis for the general LL1 problem. The analysis is based on the active set framework of the reduced gradient algorithm (RGA), as developed in Shi and Lukas (2002). Using the corresponding terminology, we will call a model equation yi=xiTβ+εi an active equation at some point β if the residual ri(β)≡xiTβ−yi equals 0 and inactive otherwise, and similarly for the constraints. Note that the residual here is the opposite of the usual definition. In this framework, many of the results can be easily visualised geometrically in terms of the movement of hyperplanes defined by the equations and constraints. Note that the results themselves are independent of the algorithm used to find the solution. Also the results are quantitative with computable formulae, making them useful in practice for diagnostic purposes. We do not assume the design matrix has full rank, but we assume throughout that the optimal solution is nondegenerate (see Definition 2.1). In Section 2 we describe the RGA framework including appropriate optimality conditions. An optimal solution to (1.1) occurs at a special kind of point determined by a basis of the vectors xi (possibly augmented). We call such a point and its associated basis matrix a base point and base matrix, respectively (see Definition 2.1). In Section 3 we investigate the effect on the optimal solution of a perturbation to the responses yi, i=1,2,…,n. In Lemma 3.1, for each inactive equation at the solution, we find an interval for yi for which the solution remains unchanged. The intervals agree with those of Narula and Wellington 1985 and Narula and Wellington 2002 and the sufficiency condition of Dupačová (1992, Eq. (13)). We also show by counterexample that these interval conditions are not necessary for the solution to remain unchanged. In Lemma 3.3, for each active equation, we derive the change (error) in the solution due to a sufficiently small perturbation in yi. The result allows one to decide which of the responses for the active equations has the greatest marginal influence on the L1 regression estimate. Theorem 3.1 considers the effect of an arbitrary (sufficiently small) perturbation Δy to the data vector y and gives a bound on the L1 norm of the error in the solution, showing that the solution is stable. The results in Section 3 are illustrated in Section 5 using numerical simulations with the well-known stack loss data from Brownlee (1965, p. 454) and the three problems of curve fitting, and the estimation of first and second derivatives using the parametric form of a trigonometric series. The estimation of derivatives is important in many areas, in particular in the analysis of human growth curves (Gasser et al., 1984 and Eubank, 1988) and pharmacokinetic data (Song et al., 1995). Numerical differentiation is an ill-posed problem which leads to a collinear (ill-conditioned) design matrix (see Anderssen and Bloomfield, 1974). Such problems were in fact a major motivation for the sensitivity analysis in this paper. The numerical simulations show that the bound derived in Theorem 3.1 is quite useful in assessing the accuracy of the L1 estimate under small perturbations in the data. In Section 4 we consider perturbations to the row vectors xiT in the model equations. In Theorem 4.1 and Corollary 4.1 we consider the inactive equations and find an interval for each element of xiT such that the optimal solution to (1.1) is unchanged. In Theorem 4.2, for each active equation, we find the change (error) due to a perturbation in the row vector, showing that the solution is stable. These results again provide useful diagnostic information in deciding which elements have the greatest influence on the solution. In Section 5, numerical simulations with the stack loss model are used to illustrate the results. Most of the sensitivity results of this paper are contained in the thesis by Shi (1997) where some more examples can be found. Results about perturbations to the constraints and the addition or deletion of observations are derived in Lukas and Shi (2004). This includes the calculation of an L1 version of the Cook distance to determine the influence of each observation on the L1 regression estimate.