The prediction of a response random interval-valued set from an explanatory one has been examined in previous developments. These developments have considered an interval arithmetic-based linear model between the random interval-valued sets and a least squares regression analysis. The least squares approach involves a generalized L2-metric between interval data; this metric weights squared distances between data location (mid-points/centers) and squared distances between data imprecision (spread/radius). As a consequence, estimators of the parameters in the linear model depend on the choice of the weights in the metric. To investigate about a suitable choice of weighting in the generalized mid/spread metric, a theoretical conclusion is first obtained. Finally, the impact of varying the weights is discussed by considering a Monte Carlo simulation study.
In investigating the relationship between random elements, regression analysis enables to seek for the causal effect of one (or several) random element(s) upon another. Regression techniques have long been relevant to many fields. Most of the regression methods assume that the involved random elements can be formalized as real-valued random variables.
However, there exists an important number of practical situations for which involved attributes do not take real but interval values. In Lubiano [32], Ferson et al. [14], [15] and [16], Billard and Diday [3], D’Urso and Giordani [11], Kreinovich et al. [28], and Chuang [8] one can find many instances of the usual sources of interval-valued data. Among them, intermittent measurements, censoring, data binning, cyclical fluctuations, ranges, and so on. So, the statistical analysis of these data becomes especially interesting in real-life.
The problem of linear regression analysis with interval data has been studied from different perspectives and in different frameworks (see, for instance, Diamond [9], Lubiano [32], Billard and Diday [2], Gil et al. [18], [19] and [20], Manski and Tamer [33], Montenegro [34], De Carvalho et al. [10], and Lima Neto and De Carvalho [30] and [31]).
A least squares approach for an interval arithmetic-based linear model has been recently carried out (see González-Rodrı´guez et al. [24] and [23], Gil et al. [21], Blanco et al. [4], and Blanco-Fernández et al. [6]). This approach involves essential and distinctive features, like the following ones:
•
The approach is based on the usual interval arithmetic to formalize the linear relationship between the response and explanatory random elements. Consequently, this approach looks jointly at the location and the imprecision characterizing interval data, instead of treating them separately.
•
The so-called t -vector function or mid/spread characterization of the nonempty compact intervals enables to identify interval data with certain R2R2-valued data. This identification allows us to induce a generalized metric between intervals, as well as the model in the probabilistic setting for interval-valued random elements and the associated relevant summary measures of its distribution.
•
The least squares methodology based on the above-mentioned arithmetic and generalized metric.
Estimators of the involved parameters in the linear model have been obtained and analyzed under general conditions [24], [23], [21], [4], [5] and [6]. The estimators depend on the metric between interval data which is considered to formalize the least squares approach. This metric generalizes the well-known in [1] (see also Trutschnig et al. [35] for a related detailed discussion).
As it has been outlined by Gil et al. [20] and Montenegro [34] the mid-point/spread (equivalent) expression of this metric has been crucial in interpreting and determining estimators of the parameters of the linear regression problem (see Gil et al. [20]), and in performing tests under linear model assumption (cf. González-Rodrı´guez et al. [24] and [23], Blanco et al. [4]). Kulpa [29] indicated the interest of the mid-point/spread tandem in some other implications from interval arithmetic. In a different approach to the regression between interval-valued data, its importance has been also pointed out later by other authors (see, for instance, De Carvalho et al. [10] and Lima Neto and De Carvalho [30] and [31]).
In Section 2 of this paper preliminary concepts and results will be presented. Section 3 recalls the regression problem between two interval-valued random elements, and the interval arithmetic-based linear model along with the associated parameters’ estimators. In Section 4, a theoretical search of a suitable choice of the metric on the basis of the mean square error of the estimators is first carried out. The conclusions from the theoretical development will be corroborated by an empirical sensitivity analysis of estimators in Monte Carlo simulations from relevant representative situations. Some concluding remarks and future directions will be finally commented.
It has been shown that the parameter estimators of the interval arithmetic-based regression model for random intervals vary depending on the weights assessed to the distances between mids and spreads. It has been formally proved also that the MSE of the estimator of the rate a was related to the relative variability of the mids w.r.t the spreads. In some cases the MSE is almost constant, and then the election of θ is almost irrelevant. Nevertheless, when the difference between the variance of spreads and the variance of the mids is big, the choice of θ is crucial to obtain more efficient (in terms of lower squared error) estimates.
The empirical results suggest to consider θ as a tuning parameter. Specifically, the quotient of the estimated variabilities of the mids and spreads of the error (or a correction suggested for general situations) have shown convenient theoretical and empirical support.
An immediately related future direction is that corresponding to the extension of the multiple linear regression problem with interval data following the ideas for the interval arithmetic-based model in this paper. Of course, on the basis of the studies in this paper, the extension to multiple regression would be rather straightforward, but the difficulties for such a study lye in the lack of theoretical developments. These developments will be rather complex, they will require the use of computational statistical techniques and it is at present an open problem for which a very introductory analysis has been made in Garcı´a-Bárzana et al. [17] and is to be deeply examined in a future.
Another future direction is that related to the development of an empirical sensitivity analysis of the choice of θ in testing independence by using the power of the corresponding test. Finally, since the metric (see Trutschnig et al. [35]) and the regression problem (see D’Urso et al. [12] and [13] and González-Rodrı´guez et al. [22]) have been also studied to deal with fuzzy values, it should be convenient to examine whether the discussion and conclusions in this paper can be preserved.