روش مرکزی و محدوده در برازش مدل رگرسیون خطی به داده های فاصله نمادین
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|24256||2008||16 صفحه PDF||سفارش دهید||8905 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 52, Issue 3, 1 January 2008, Pages 1500–1515
This paper introduces a new approach to fitting a linear regression model to symbolic interval data. Each example of the learning set is described by a feature vector, for which each feature value is an interval. The new method fits a linear regression model on the mid-points and ranges of the interval values assumed by the variables in the learning set. The prediction of the lower and upper bounds of the interval value of the dependent variable is accomplished from its mid-point and range, which are estimated from the fitted linear regression model applied to the mid-point and range of each interval value of the independent variables. The assessment of the proposed prediction method is based on the estimation of the average behaviour of both the root mean square error and the square of the correlation coefficient in the framework of a Monte Carlo experiment. Finally, the approaches presented in this paper are applied to a real data set and their performance is compared.
Predicting the behaviour of a (dependent) variable in relation to other (independent) variables that are thought be responsible for the variability of the former is an important task in data analysis, pattern recognition, data mining, machine learning, etc. The classical regression model is used to predict the values of a dependent quantitative variable in relation to the values of independent quantitative variables. However, to fit this model to the data, it is necessary to estimate a vector ββ, of parameters from the data vector YY and the model matrix XX, assumed with complete rank pp. The estimation using the least square method does not require any probabilistic hypothesis on the variable YY. This method consists of minimising the sum of the square of residuals. A detailed study on linear regression models for usual quantitative data can be found in Draper and Smith (1981), Montgomery and Peck (1982), Scheffé (1959), as well as others. In regression analysis of usual data, the items are usually represented as a vector of quantitative measurements for which each column represents a variable. In practice, however, this model is too restrictive to represent complex data. In order to take into account the variability and/or uncertainty inherent to the data, variables must assume sets of categories or intervals, possibly even with frequencies or weights. Such type of data have been mainly studied in symbolic data analysis (SDA), which is a domain in the area of knowledge discovery and data management related to multivariate analysis, pattern recognition and artificial intelligence. The aim of SDA is to provide suitable methods (clustering, factorial techniques, decision trees, etc.) for managing aggregated data described by multi-valued variables, for which the cells of the data table contain sets of categories, intervals or weight (probability) distributions ( Bock and Diday, 2000). As mentioned above, the items are usually represented as a vector of quantitative measurements. However, due to recent advances in information technologies, it is now common to record interval data. In the framework of SDA, interval data appear when the observed values of the variables are intervals from the set of real numbers RR. Interval data arise in situations such as recording monthly interval temperatures in meteorological stations, daily interval stock prices, etc. Another source of interval data is the aggregation of huge data-bases into a reduced number of groups, the properties of which are described by symbolic interval variables. Therefore, tools for symbolic interval data analysis are very much required. Currently, different approaches have been introduced to analyse symbolic interval data. Regarding univariate statistics, Bertrand and Goupil (2000) and Billard and Diday (2003) introduced central tendency and dispersion measures suitable for symbolic interval data. DeCarvalho (1995) proposed histograms for symbolic interval data. Factorial methods for analysing symbolic interval data have also been considered in SDA. Cazes et al. (1997) and Lauro and Palumbo (2000) introduced principal component analysis methods suitable for symbolic interval data. Palumbo and Verde (2000) and Lauro et al. (2000) generalised factorial discriminant analysis (FDA) to symbolic interval data. Regarding classification, Ichino et al. (1996) introduced a symbolic classifier as a region-oriented approach for symbolic interval data. Rasson and Lissoir (2000) presented a symbolic kernel classifier based on dissimilarity functions suitable for symbolic interval data. Périnel and Lechevallier (2000) proposed a tree-growing algorithm for classifying symbolic interval data. SDA provides a number of clustering methods for symbolic data. These methods differ with regard to the type of symbolic data considered, their cluster structures and/or the clustering criteria considered. With hierarchical clustering methods, an agglomerative approach has been introduced that forms composite symbolic objects using a join operator whenever mutual pairs of symbolic objects are selected for agglomeration based on minimum dissimilarity (Gowda and Diday, 1991) or maximum similarity (Gowda and Diday, 1992). Ichino and Yaguchi (1994) defined generalised Minkowski metrics for mixed feature variables and present dendrograms obtained from the application of standard linkage methods for data sets containing numeric and symbolic feature values. Chavent (1998) proposed a divisive clustering method for symbolic data that simultaneously furnishes a hierarchy of the symbolic data set and a monothetic characterisation of each cluster in the hierarchy. Guru et al. (2004) introduced agglomerative clustering algorithms based on similarity functions that are multi-valued and non-symmetric. Regarding partitioning clustering algorithms for symbolic interval data, Bock (2002) proposed several clustering algorithms for symbolic data described by interval variables, and presented a sequential clustering and updating strategy for constructing a self-organising map (SOM) to visualise symbolic interval data. Chavent and Lechevallier (2002) proposed a dynamic clustering algorithm for interval data where the class representatives are defined by an optimality criterion based on a modified Hausdorff distance. Souza and De Carvalho (2004) presented partitioning clustering methods for interval data based on (adaptive and non-adaptive) city-block distances. More recently, De Carvalho et al. (2006) proposed an algorithm using an adequacy criterion based on adaptive Hausdorff distances. This paper addresses linear regression models for predicting symbolic interval data. Billard and Diday (2000) presented the first approach to fitting a linear regression model to symbolic interval data sets from an SDA perspective. Their approach consists of fitting a linear regression model to the mid-points of the interval values assumed by the symbolic interval variables in the learning set and applies this model to the lower and upper bounds of the interval values of the independent symbolic interval variables to be predicted the lower and upper bounds of the interval value of the dependent variable, respectively. This paper introduces a Centre and Range approach to fitting a linear regression model to symbolic interval data. The probabilistic assumptions that involve the linear regression model theory for classical data will not be considered in the case of symbolic data (symbolic interval variables), as this remains an open research topic. Thus, the problem will be investigated as an optimisation problem, in which we seek to minimise a predefined criterion. In Table 1, we show the criteria and models that represent the three approaches presented in this paper. Table 1. Criteria and corresponding models presented in this paper Criterion Corresponding univariate model View the MathML source∑i=1n(εic)2 yLi=β0+β1ai+εLiyLi=β0+β1ai+εLi and yUi=β0+β1bi+εUiyUi=β0+β1bi+εUi View the MathML source∑i=1n((εiL)2+(εiU)2) View the MathML sourceyiL=β0L+β1Lai+εiL and View the MathML sourceyiU=β0U+β1Ubi+εiU View the MathML source∑i=1n((εic)2+(εir)2) View the MathML sourceyic=β0c+β1cxic+εic and View the MathML sourceyir=β0r+β1rxir+εir Table options The first method (Billard and Diday, 2000) is based on the minimisation of the mid-point error, since View the MathML source(εLi+εUi)/2=εic. The lower and upper bounds of the dependent variable are predicted, respectively, from the lower and upper bounds of the independent variable using the same vector of parameters ββ. The second approach (Billard and Diday, 2002) fits two independent linear regression models on the lower and upper bounds of the intervals, respectively, and minimises View the MathML source∑i=1n(εiL)2+∑i=1n(εiU)2. The third approach considers the minimisation of the sum of the mid-point square error plus the sum of the range square error, and the reconstruction of the interval bounds is based on the mid-point and range estimates. In order to show the usefulness of these approaches, the lower and upper bounds of the interval values of an interval-valued variable that is linearly related to a set of independent interval-valued variables will be predicted for independent data sets according to each method. The assessment of the proposed methods will be based on the estimation of the average behaviour of the root mean square error and the square of the correlation coefficient in the framework of a Monte Carlo experiment. This paper is organised as follows: Section 2.1 presents the Centre ( Billard and Diday, 2000) and the MinMax ( Billard and Diday, 2002) methods from an optimisation perspective. Section 2.2 presents the Centre and Range approach to fitting a linear regression model to interval-valued data. To show the usefulness of the Centre and Range approach, Section 3 describes the framework of the Monte Carlo simulations and presents experiments with synthetic and real interval-valued data sets. Finally, Section 4 gives the concluding remarks.
نتیجه گیری انگلیسی
This paper presented a CRM for fitting a linear regression model to interval-valued data. The method uses the information contained in the mid-points and ranges of the intervals based on a predefined criterion. The assessment of the proposed prediction method was based on the average behaviour of the root mean square error and the square of the correlation coefficient in the framework of a Monte Carlo simulation. The synthetic symbolic interval data sets were constructed with (and without) dependencies between the values of the mid-points and ranges of the interval values. The performance of the proposed approach was also measured in a cardiological symbolic interval data set. The comparison between the CRM method and CM and MinMax methods demonstrated the importance of taking into account the range information in models for predicting symbolic interval data. Considering synthetic symbolic interval data sets with the mid-points and ranges of the interval values generated independently and with a dependence relation between the simulated values of mid-points and ranges of the intervals, the Monte Carlo simulations clearly demonstrated the superiority of the CRM method compared to the other approaches. The CM method exhibited the worst performance. For the cardiological symbolic interval data set, the CRM method had the best prediction performance, whereas CM method had the worst.