The least-trimmed squares estimation (LTS) is a robust solution for regression problems. On the one hand, it can achieve any given breakdown value by setting a proper trimming fraction. On the other hand, it has -consistency and asymptotic normality under some conditions. In addition, the LTS estimator is regression, scale, and affine equivariant. In practical regression problems, we often need to impose constraints on slopes. In this paper, we describe a stable algorithm to compute the exact LTS solution for simple linear regression with constraints on the slope parameter. Without constraints, the overall complexity of the algorithm is in time and O(n2) in storage. According to our numerical tests, constraints can reduce computing load substantially. In order to achieve stability, we design the algorithm in such a way that we can take advantage of well-developed sorting algorithms and softwares. We illustrate the algorithm by some examples.
Statisticians routinely apply regression analysis to fit models to observations. To deal with outliers, we seek for robust and resistant regression procedures. Quite some number of perspectives exist in the literature regarding the definition of robustness. For example, Huber (1964) studied robustness from the point of view of minimax variance. Hampel 1971 and Hampel 1974 proposed the idea of influence function as an asymptotic tool to study robustness. Breakdown point is another important notion in robust analysis. Donoho and Huber (1983) defined a finite-sample version of breakdown point. Consider the classical linear model
equation(1)
y=Xβ+ε,
where , and X=(xij)i=1,…,n,j=1,…,p. For a set of parameter β0, we define the residuals by r(β0)=y−Xβ0. The least-squares estimator minimizes the sum of squares over β. The breakdown value of least-squares estimator is 1/n→0 as n→∞. On the other hand, the highest possible breakdown point is 50%. One solution that reaches this bound of breakdown point is the least median of squares (LMS) estimator, cf. Rousseeuw (1984), which minimizes the median of squared residuals: . Unfortunately, the asymptotic efficiency of LMS is unsatisfactory because its convergence rate is only of the order n−1/3. Another robust solution is the least-trimmed squares (LTS) estimator, which takes as its objective function the sum of smallest squared residuals; see Rousseeuw (1984). We denote the squared residuals in the ascending order by |r2(β)|(1)⩽|r2(β)|(2)⩽⋯⩽|r2(β)|(n). Then the LTS estimate of coverage h is obtained by
This definition implies that observations with the largest residuals will not affect the estimate. The LTS estimator is regression, scale, and affine equivariant; see Rousseeuw and Leroy (1987, Lemma 3, Chapter 3). In terms of robustness, we can roughly achieve a breakdown point of ρ by setting h=[n(1−ρ)]+1. In terms of efficiency, -consistency and asymptotic normality similar to M-estimator exist for LTS under some conditions; see Vı́šek 1996 and Vı́šek 2000 for example. Despite its good properties, the computation of LTS remains a problem.
The problem of computing the LTS estimate of β is equivalent to searching for the size-h subset(s) whose least-squares solution achieves the minimum of trimmed squares. The total number of size-h subsets in a sample of size n is . A full search through all size-h subsets is impossible unless the sample size is small. Several ideas have been proposed to compute approximate solutions. First, instead of exhaustive search we can randomly sample size-h subsets. Second, Rousseeuw and Van Driessen (1999) proposed a so-called C-step technique (C stands for “concentration”). That is, having selected a size-h subset, we apply the least-squares estimator to them. Next, for the estimated regression coefficients, we evaluate residuals for all observations. Then a new size-h subset with the smallest squared residuals is selected. This step can be iterated starting from any subset. In the case of estimating a location parameter, Rousseeuw and Leroy (1987, pp. 171–172), described a procedure to compute the exact LTS solution. Rousseeuw and Van Driessen (1999) applied this idea to adjust the intercept in the regression case. The idea of subset search is also relevant for LMS; see Hawkins (1993). Hössjer (1995) described an algorithm for computing the exact LTS estimate of simple linear regression, which requires computations and O(n2) storage. A refined algorithm, which requires computations and O(n) storage, was also sketched. However, Hössjer remarked that the refined algorithm is not stable.
In this paper, we describe an algorithm that computes the exact solution of LTS for simple linear regression with constraints on the slope. The idea is to divide the range of slope into regions in such a way that within each region the order of residuals is unchanged. As a result, the search for LTS within each such region becomes a problem of linear complexity. Hössjer (1995) considered a similar idea from the perspective of dual plots and absolute dual plots. In comparison, we take a direct treatment to deal with constraints by checking Kuhn–Tucker conditions and ties of slopes by defining equivalent classes. Without constraints, the overall complexity of the algorithm is in time and O(n2) in storage. In practical regression problems, we often need to impose constraints on slopes. According to our numerical tests, these constraints can reduce computing load substantially. In order to achieve stability, we design the algorithm in such a way that we can take advantage of well-developed sorting algorithms and softwares.
We arrange the materials in this paper as follows. In Section 2, we describe the main results. In Section 3, first we briefly describe the color-correction problem in DNA sequencing that motivates our work. Then we apply our algorithm to a simulated example. In Section 4, we discuss some relevant issues.