دانلود مقاله ISI انگلیسی شماره 24657
ترجمه فارسی عنوان مقاله

درباره روش احتمالاتی برای مدل رگرسیون خطی شامل داده های نامشخص، مبهم و یا فاصله زمانی

عنوان انگلیسی
On the possibilistic approach to linear regression models involving uncertain, indeterminate or interval data
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
24657 2013 22 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Information Sciences, Volume 244, 20 September 2013, Pages 26–47

ترجمه کلمات کلیدی
داده ها با فاصله - داده های غیر قطعی - رگرسیون احتمالاتی - پیچیدگی محاسباتی -
کلمات کلیدی انگلیسی
Interval data, Uncertain data, Possibilistic regression, Computational complexity,
پیش نمایش مقاله
پیش نمایش مقاله  درباره روش احتمالاتی برای مدل رگرسیون خطی شامل داده های نامشخص، مبهم و یا فاصله زمانی

چکیده انگلیسی

We consider linear regression models where both input data (the observations of independent variables) and output data (the observations of the dependent variable) are affected by loss of information caused by uncertainty, indeterminacy, rounding or censoring. Instead of real-valued (crisp) data, only intervals are available. We study a possibilistic generalization of the least squares estimator, so called OLS-set for the interval model. Investigation of the OLS-set allows us to quantify whether the replacement of real-valued (crisp) data by interval values can have a significant impact on our knowledge of the value of the OLS estimator. We show that in the general case, very elementary questions about properties of the OLS-set are computationally intractable (assuming P ≠ NP). We also focus on restricted versions of the general interval linear regression model to the crisp input case. Taking the advantage of the fact that in the crisp input – interval output model the OLS-set is a zonotope, we design both exact and approximate methods for its description. We also discuss special cases of the regression model, e.g. a model with repeated observations.

مقدمه انگلیسی

Consider the linear regression model equation(1) y=Xβ+ε,y=Xβ+ε, Turn MathJax on where y denotes the vector of observations of the dependent variable, X denotes the design matrix of the regression model, β denotes the vector of unknown regression parameters and ε is the vector of disturbances. For the purposes of this paper, we do not need to make any special assumptions on probabilistic properties of ε. We just assume that for estimation of β a linear estimator can be used, i.e. an estimator of the form equation(2) View the MathML sourceβˆ=Qy, Turn MathJax on where Q is a matrix. In particular we shall concentrate on the Ordinary Least Squares (OLS) estimator, which corresponds to the choice Q = (XTX)−1XT in (2). (As it is well-known, this estimator is a “good” estimator e.g. when the disturbances are independent, identically distributed, with zero mean and finite variance.) Nevertheless, the theory is also applicable for other linear estimators, such as the Generalized Least Squares (GLS) estimator, which corresponds to the choice Q = (XTΩ−1X)−1Ω−1XT in (2), where Ω is either known or estimated covariance matrix of ε. Other examples include estimation methods which, at the beginning, exclude outliers and then apply OLS or GLS. These estimators are often used in analysis of contaminated data. Throughout the paper, the symbol n stands for the number of observations and the symbol p stands for the number of regression parameters, as it is usual in statistics. We shall treat X and y as constants representing observed values of the independent variables and the dependent variable, respectively. Then the tuple (X, y) is called data for the regression model (1). 1.1. Interval data in the linear regression model We shall study the situation when the data (X , y ) cannot be observed directly. Instead of y i and X ij, only intervals of the form View the MathML source[y̲i,y¯i] and View the MathML source[X̲ij,X¯ij] are available, where it is guaranteed that for all i ∈ {1, … , n} and j ∈ {1, … , p}, View the MathML sourceyi∈[y̲i,y¯i]andXij∈[X̲ij,X¯ij], Turn MathJax on where yi denotes the ith element of y and Xij denotes the (i, j)th element of X. The replacement of real-valued (crisp) data by intervals is henceforth referred to as “censoring”. In some literature, this process is also called “trimming”, “uncertaintification” or “intervalization”. 1.2. Motivation Inclusion of interval data in linear regression models is suitable for modeling variety of real-world problems. For example: • The data (X, y) have been interval-censored. This is often the case of medical, epidemiologic or demographic data—only interval-censored data are published while the exact individual values are kept secret. • Data are rounded. If we store data using data types of restricted precision, then instead of exact values we are only guaranteed that the true value is in an interval of width 2−d where d is the number of bits of the data type for representation of the non-integer part. For example, if we store data as integers (i.e., d = 0), then we know only the interval View the MathML source[ỹ-0.5,ỹ+0.5] instead of the exact value y , where View the MathML sourceỹ is y rounded to the nearest integer. This application is important in the theory of reliable computing. • The data are uncertain or unstable. For that reason it might be inappropriate to describe them in terms of fixed values (X, y) only. • Categorical data may be sometimes interpreted as interval data; for example, credit rating grades can be understood as intervals of credit spreads over the risk-free yield curve. • In econometric regression models, it is often the case that varying quantities are represented by their average or median values. For example, if the exchange rate for a period of 1 year should be included in the regression model, usually the average rate of that year is taken. However, it might be more appropriate to regard the exchange rate as an interval inside which the variable changes. • Sometimes we use interval predictions as data in regression models. For example, consider a predictor of future inflation (an econometric model or a panel of experts, say), which is assumed to form inflation expectations. The predictions are interval. Then, another model—such as consumption model or capital expenditure model—uses the predicted inflation expectations as a regressor. Thus, the model has to be able to work with an interval regressor. More applications of interval data in econometrics are found in [7]. Applications in information sciences can be found in [11]; see also applications in ergonomics [10], optimization and operational research [15], [37], [42] and [71], speech learning [45] and in pattern recognition [39] and [43]. A variety of methods for estimation of regression parameters in a regression with interval data has been developed; they are studied in statistics [8], [22], [36], [41], [44], [49], [55] and [76], where also robust regression methods have been proposed [32] and [50], in fuzzy theory [24], [29], [30], [72], [73] and [74] as well as in computer science [12], [31] and [34]. An algebraic treatment of least squares methods for interval data has been considered in [5] and [18].

نتیجه گیری انگلیسی

We have studied several properties of the set OLS(X,y) in the interval input – interval output model. It turned out that even very elementary questions about the set OLS(X,y) are computationally intractable. This negative result motivates a study on special cases, where we can hope that the situation is better. We devoted our effort to the practically important case, when the input data X are crisp. Then, the set OLS(X, y) has better geometric and algorithmic properties. In particular, various descriptions of the set OLS(X, y) can be constructed efficiently, provided the number of regression parameters is low compared to the number of observations (which is a typical case in data analysis). Formally, we stated the results in the form that if the number of regression parameters is fixed, then many tasks can be solved in polynomial time. We also dealt with some special regression models, such as models with repeated observations, where we can achieve further speedup. Finally we turned our attention back to the general case of interval input – interval output models from a practical point of view. For practical purposes, a variety of methods for finding interval enclosures for the set OLS(X,y) are available. Nevertheless, we have constructed some elementary examples showing that the methods can provide highly redundant (and hence practically useless) results. The main drawback is that the methods lean on relaxation as described in Section 6.1. We expect that the special form of dependence, either in the system (11) or in the system (12), could be further analyzed and utilized for improvement of the known enclosure methods. Moreover, not only interval enclosures, but also other types of enclosures (such as ellipsoidal enclosures) could be successful. We also expect that ongoing research will bring further improvements. It might be possible to improve Theorem 18 as discussed in Section 4.3. We also expect that further applications of the RRR metaalgorithm of Section 5 could be found, in particular in polyhedral geometry. We also think that methods of Sections 4 and 5 are suitable for implementation in software for analysis of interval data.