الگوریتم های آنلاین برای محاسبات میانگین و واریانس داده بازه و استفاده از آنها در سیستم های هوشمند
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
5531 | 2007 | 11 صفحه PDF |

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Sciences, Volume 177, Issue 16, 15 August 2007, Pages 3228–3238
چکیده انگلیسی
When we have only interval ranges View the MathML source[x̲i,xi¯] of sample values x1, … , xn, what is the interval View the MathML source[V̲,V¯] of possible values for the variance V of these values? There are quadratic time algorithms for computing the exact lower bound V on the variance of interval data, and for computing View the MathML sourceV¯ under reasonable easily verifiable conditions. The problem is that in real life, we often make additional measurements. In traditional statistics, if we have a new measurement result, we can modify the value of variance in constant time. In contrast, previously known algorithms for processing interval data required that, once a new data point is added, we start from the very beginning. In this paper, we describe new algorithms for statistical processing of interval data, algorithms in which adding a data point requires only O(n) computational steps.
مقدمه انگلیسی
1.1. Let us start with a big picture Before we describe a specific problem that we solve in this paper, let us first describe how, in our view, this problem fits into a big picture of information processing in intelligent systems. Readers who are familiar with this big picture and/or who are only interested in our technical results can skip this subsection. One of the main specific features of information processing in intelligent systems is that in such systems, we often have very limited knowledge. As a result, processing of imprecise information is necessary in intelligent systems. A typical example is the processing of linguistic information, i.e., information represented by experts in terms of words from a natural language. This information can be modeled, e.g., by fuzzy sets (see, e.g., [16] and [25]). For such modeling, when an expert states that a value is, say, small but not very small, we describe this expert information in terms of an appropriate fuzzy set. A particular case of such a statement is when an expert states that the actual value is between, say, 0.1 and 0.3. After such a statement, the only information about the actual (unknown) value of the desired quantity is that it belongs to the interval [0.1, 0.3] – and each interval (and, more generally, each set) can be viewed as a particular example of a more general concept of a fuzzy set. Since the knowledge about each quantity is represented in such a form, it is necessary to be able to develop inference procedures for such observations. Mathematical analysis of this problem is therefore crucial for designing intelligent systems. In this paper, we analyze an important particular case of this set-valued data. Specifically, in this paper, we investigate the computational aspects of processing interval-valued data. Let us now describe our problem and its motivation in more detail. 1.2. Why data processing? In intelligent systems, there are at least two sources of information about physical quantities: measurements and expert estimates. In many real-life situations, we are interested in the value of a physical quantity y that is difficult or impossible to measure directly and difficult for experts to estimate. Examples of such quantities are the distance to a star and the amount of oil in a given well. Since we cannot measure or estimate the value y of the desired physical quantity directly, a natural idea is to measure or estimate y indirectly. Specifically, we find some easier-to-measure or easier-to-estimate quantities x1, … , xn which are related to y by a known relation y = f(x1, … , xn). For example, to find the resistance R, we measure or estimate current I and voltage V, and then use the known relation R = V/I to estimate resistance as View the MathML sourceR∼=V∼/I∼. This relation may be a simple functional transformation, or a complex algorithm (e.g., for the amount of oil, a numerical solution to an inverse problem). It is worth mentioning that in the vast majority of these cases, the function f(x1, … , xn) that describes the dependence between physical quantities is continuous. In such cases, to estimate y, we first measure or estimate the values of the quantities x1, … , xn, and then we use the results View the MathML sourcex∼1,…,x∼n of these measurements or estimates to compute an estimate View the MathML sourcey˜ for y as View the MathML sourcey˜=f(x∼1,…,x∼n). Comment. In this paper, for simplicity, we consider the case when the relation between xi and y is known exactly; in practical situations, we often only know an approximate relation between xi and y. 1.3. Why interval computations? From probabilities to intervals Neither measurements nor estimates are 100% accurate, so in reality, the actual value xi of quantity i can differ from the result View the MathML sourcexi∼ obtained by measurement or by estimation. Because of these measurement (estimation) errorsView the MathML sourceΔxi=defxi∼-xi, the result View the MathML sourcey˜=f(x∼1,…,x∼n) of data processing is, in general, different from the actual value y = f(x1, … , xn) of the desired quantity y[29]. It is desirable to describe the error View the MathML sourceΔy=defy˜-y of the result of data processing. To do that, we must have some information about the errors of direct measurements and/or estimates. What do we know about the errors Δxi related to expert estimation? Often, an expert can provide boundsxi and View the MathML sourcexi¯ for the estimated quantity xi. Then, the actual (unknown) value of xi belongs to the interval View the MathML sourcexi=[x̲i,xi¯]. Often, these bounds come in the form of an unsigned error estimate Δi on the expert’s estimation accuracy: for example, an expert may say that the actual fish population in a lake is 50,000 ± 20,000. In this case, View the MathML sourcexi∼=50,000, Δi = 20,000, so View the MathML sourcex̲i=xi∼-Δi and View the MathML sourcexi¯=xi∼+Δi. Comment. For readers who may be interested in how the above description is related to fuzzy sets, here is an explanation. Often, in addition to (or instead of) the bounds, an expert can provide bounds that contain xi with a certain degree of confidence (not necessarily represented by a probability). Often, we know several such bounding intervals corresponding to different degrees of confidence. Such a nested family of intervals is also called a fuzzy set, because it turns out to be equivalent to a more traditional definition of fuzzy set [6], [16], [23], [24] and [25] (if a traditional fuzzy set is given, then different intervals from the nested family can be viewed as α-cuts corresponding to different levels of uncertainty α). What do we know about the errors Δxi of direct measurements? First, the manufacturer of the measuring instrument must supply us with an upper bound Δi on the measurement error. If no such upper bound is supplied, this means that no accuracy is guaranteed, and the corresponding “measuring instrument” is practically useless. In this case, once we perform a measurement and get a measurement result View the MathML sourcexi∼, we know that the actual (unknown) value xi of the measured quantity belongs to the interval View the MathML sourcexi=[x̲i,xi¯], where View the MathML sourcex̲i=xi∼-Δi and View the MathML sourcexi¯=xi∼+Δi. In many practical situations, we not only know the interval [−Δi, Δi] of possible values of the measurement or estimation error; we also know the probability of different values Δxi within this interval. This knowledge underlies the traditional engineering approach to estimating the error of indirect measurement, in which we assume that we know the probability distributions for measurement errors Δxi. In practice, we can determine the desired probabilities of different values of Δxi by comparing the results of measuring with this instrument (or results of expert estimation) with the results of measuring the same quantity by a standard (much more accurate) measuring instrument. Since the standard measuring instrument is much more accurate than the one used, the difference between these two measurement results is practically equal to the measurement error; thus, the empirical distribution of this difference is close to the desired probability distribution for measurement error. There are two cases, however, when this determination is not done: • First is the case of cutting-edge measurements, e.g., measurements in fundamental science. When the Hubble telescope detects the light from a distant galaxy, there is no “standard” (much more accurate) telescope floating nearby that we can use to calibrate the Hubble: the Hubble telescope is the best we have. • Second is the case of many commercial measuring instruments. In this case, in principle, every sensor can be thoroughly calibrated, but sensor calibration is so costly – usually costing 10 times more than the sensor itself – that manufacturers rarely do it. In both cases, we have no information about the probabilities of the Δxi; the only information we have is the upper bound on the measurement or estimation error. Therefore, after we performed a measurement and got a measurement result View the MathML sourcexi∼, the only information that we have about the actual value xi of the measured quantity is that it belongs to the interval View the MathML sourcexi=[x∼i-Δi,xi∼+Δi]. In such situations, the only information that we have about the actual value of y = f(x1, … , xn) is that y belongs to the range View the MathML sourcey=[y̲,y¯] of the function f over the box x1 × ⋯ × xn: View the MathML sourcey=[y̲,y¯]={f(x1,…,xn)|x1∈x1,…,xn∈xn}. Turn MathJax on For continuous functions f(x1, … , xn), this range is an interval. The process of computing this interval range based on the input intervals xi is called interval computation; see, e.g., [13], [14], [15] and [22]. Comment. When, instead of a single interval, we have several intervals corresponding to different levels of confidence, we must perform interval computations on each level [6], [16], [23], [24] and [25]. 1.4. Interval computations techniques: brief reminder Historically the first method for computing the enclosure for the range is the method which is sometimes called “straightforward” interval computations. This method is based on the fact that, inside the computer, every algorithm consists of elementary operations (arithmetic operations, min, max, etc.). For each elementary operation f(a, b), if we know the intervals a and b for a and b, we can compute the exact range f(a, b). The corresponding formulas form the so-called interval arithmetic. For example, View the MathML source[a̲,a¯]+[b̲,b¯]=[a̲+b̲,a¯+b¯];[a̲,a¯]-[b̲,b¯]=[a̲-b¯,a¯-b̲], Turn MathJax on View the MathML source[a̲,a¯]·[b̲,b¯]=[min(a̲·b̲,a̲·b¯,a¯·b̲,a¯·b¯),max(a̲·b̲,a̲·b¯,a¯·b̲,a¯·b¯)]. Turn MathJax on In straightforward interval computations, we repeat the computations forming the expression for f (or, more generally, a program for computing f) step-by-step, replacing each operation on real numbers by the corresponding operation on intervals. It is known that, as a result, we get an enclosure Y ⊇ y for the desired range. In some cases, this enclosure is exact. In more complex cases (see examples below), the enclosure has excess width. There exist more sophisticated techniques for producing a narrower enclosure, e.g., a centered form method. However, for each of these techniques, there are cases when we get an excess width. The reason for such an excess width is that, as shown in [18] and [32], the problem of computing the exact range is known to be NP-hard even for polynomial functions f(x1, … , xn) (actually, even for quadratic functions f). In this paper, we analyze a specific class of interval computations problems – when the algorithm f(x1, … , xn) is one of the traditional statistical data processing algorithms such as computing the mean or a variance of the population sample x1, … , xn. 1.5. From the statistical viewpoint, this problem is a particular case of robust statistics Interval uncertainty means that we do not know the exact probability distribution for measurement or estimation error; instead, we only know that this distribution belongs to a known collection of distribution – namely, to the collection of all probability distributions that are non-zero only in the given interval. Situations when we only know a collection of distributions are described by robust statistics (see, e.g., 12), and our problem of estimating population variance is in line with the problems traditionally solved by robust statistics: many known algorithms in the area of robust statistics also return a guaranteed robust estimate for the population mean and population variance, which holds for a collection of distributions. One may expect that these problems have already been solved in robust statistics. However, while robust statistics does have a lot of useful and interesting results about the guaranteed bounds on the mean for many classes of distributions, the problem of how to actually compute guaranteed bounds on the population variance has not (as we have been able to determine) yet been solved. Comment. In this paper, we solve a very specific problem related to a combination of interval and probabilistic uncertainty. For a more general context and for other practical problems related to such a combination, see, e.g., [2], [3], [4], [5], [7], [9], [10], [17], [19], [20], [21], [23], [28], [30], [31], [33] and [34] and references therein.