روش های به حداقل رساندن خطای حداکثر آنتروپی و حداقل مربعات برای تخمین احتمالات شرطی از دست رفته در شبکه های بیزی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
28623 | 2008 | 20 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 52, Issue 7, 15 March 2008, Pages 3583–3602
چکیده انگلیسی
Conditional probability tables (CPT) in many Bayesian networks often contain missing values. The problem of missing values in CPT is a very common problem and occurs due to the lack of data on certain scenarios that are observed in the real world but are missing in the training data. The current approaches of addressing the problem of missing values in CPT are very restrictive in that they assume certain probability distributions for estimating missing values. Recently, maximum entropy (ME) approaches have been used to learn features of probability distribution functions from the observed data. The ME approaches do not require any data distribution assumptions and are shown to work well for several non-parametric distributions. The ME and least square (LS) error minimizing approaches can be used for estimating missing values in CPT for Bayesian networks. The applications of ME and LS approaches for estimating missing CPT require researchers to solve difficult constrained non-linear optimization problems. These difficult constrained non-linear optimization problems can be solved using genetic algorithms.
مقدمه انگلیسی
Missing conditional probability values in conditional probability tables (CPT) are a very common problem in a Bayesian network design (Pearl, 1988). The reasons for missing values are high cost of data collection, data punching errors, longitudinal high dimensional data (Formann, 2007), and unavailable data (Demirtas et al., 2007). The current literature on missing values in CPT can be divided into two streams. The first stream is related to missing values in CPT due to incomplete data. The second stream is related to missing values in CPT with complete data. Incomplete data sets contain some unreported entries in the data set, which makes conditional probability estimation a difficult task. Earlier approaches of handling missing values included the Monte Carlo simulation approach suggested by Henrion (1987) and described in Pearl (1988), and other approaches summarized in Heckerman (1997). All of the approaches suggested by Heckerman (1997) are either computationally intractable, or require knowledge of the expectation function or probability density/distribution function. Recent approaches of estimating conditional probabilities from incomplete data sets include the use of the expectation maximization procedure, Gibbs Sampling, Bound and Collapse approach (Ramoni and Sebastiani, 1998) and robust Bayesian estimator (RBE) (Ramoni and Sebastiani, 2001). Some of the recent methods assume that missing data follows a certain pattern. However, the RBE approach does not assume any particular missing data pattern and has a polynomial computational complexity (Ramoni and Sebastiani, 2001). Two approaches, expectation maximization and Gibb Sampling, are known to get trapped into local minima and are not guaranteed to converge (Ramoni and Sebastiani, 1999). Since the RBE is a relatively new approach, we do not have any information on the convergence of the RBE. Missing conditional probabilities can occur in complete data sets. This is likely to happen when the size of a data set is small. For example, assume a data set containing three variables AA, BB, and CC with each taking three discrete values. There will be a total of 33=27 unique combinations of their values. Assuming that all unique combinations can legitimately occur in the real world, a complete data set of size 25 will contain a minimum of two missing combinations. While the data set might appear complete, when a Bayesian network is constructed using the data, the missing combinations will lead to missing values in conditional probability tables for the Bayesian network. Since the data set is complete, approaches used to estimate missing conditional probabilities with incomplete data set cannot be used here. In the Bayesian network literature, this problem is sometimes called “filling in” missing conditional probabilities (Paris, 2005) or estimating conditional probability with incomplete information (Holmes et al., 1998). The problem of estimating missing conditional probabilities in a complete data set is known to be computational complex with multiple solutions (Paris, 2005). The solutions proposed for solving this problem assume a unique solution (Rhodes and Garside, 1996) or assume a uniform distribution for unknown conditional probabilities (Paris, 2005). A procedure proposed by Paris (2005), called center of mass inference (CMI) procedure, provides a fast computationally efficient trivial solution. All the procedures proposed to estimate missing conditional probabilities with complete data sets suffer from two drawbacks, (1) they assume a unique solution, and (2) they use a local search procedure to obtain the unique solution. In addition to these two common limitations, the procedures suffer from additional individual limitations. For example, the CMI procedure focuses only on estimating missing conditional probabilities and does not consider the values of known conditional probabilities. Additionally, the CMI procedure does not satisfy Language Invariance (Paris, 2005). The CMI procedure, however, is the most computationally efficient procedure. In this paper, we focus on the problem of estimating missing conditional probabilities with a complete data set. We pose the problem of estimating missing conditional probabilities as non-linear mathematical programming problems with different objective functions. We show that the mathematical programming problems have multiple solutions. Unlike previous studies on estimating the values of missing conditional probabilities with complete data set, we use a global search genetic algorithm (GA) procedure to search for a best possible solution among different multiple solutions. GAs have gained popularity in computational statistics literature and recent research illustrates several successful applications of GAs for bank rating (Krink et al., 2007), variable selection in regression models (Kapetanios, 2007) and maximum likelihood estimation for the threshold vector error correction model (Yang et al., 2007). We provide a general description of estimating the missing values in CPT for complete data sets. Assume that View the MathML sourcep=(p1,…,pn) is a vector of actual conditional probabilities for a Bayesian network, where not all components of the vector View the MathML sourcep are known. Further, assume that a technique estimates the missing values of the CPT in the Bayesian network, and provides a vector View the MathML sourcep∗=(p1∗,…,pn∗) which is an approximation of View the MathML sourcep where all components of View the MathML sourcep∗ are known. Theoretically, the magnitude of the error of approximation, ee, is given by View the MathML sourcee=∑i=1n|(pi−pi∗)|. Since not all components of vector View the MathML sourcep are known, the error ee is considered only for known components. In our research, a technique that estimates all values of View the MathML sourcep∗=(p1∗,…,pn∗), and minimizes ee for known values of View the MathML sourcep, and satisfies all the laws of probabilities is considered to be a good estimator of View the MathML sourcep. To design a technique that best minimizes ee, certain assumptions about the distribution of ee are required. If ee is assumed to be normal, independent, and identically distributed then the technique minimizing least square (LS) errors may be the best estimator. However, if the distribution of ee has fatter tails than the normal distribution then the LS estimator will be inefficient (Wu and Stengos, 2005) and adaptive estimators are necessary for establishing the best estimator. Maximum entropy (ME) is a set of general purpose partially adaptive quasi-maximum likelihood estimators that nest most commonly used mathematical distributions (Wu and Stengos, 2005) including normal and tt distributions. In a recent study, Wu and Stengos (2005) found that the ME estimators show considerable degree of adaptiveness to different shapes of error distributions and work well when compared to other methods. In this paper, we propose two approaches to estimate the vector View the MathML sourcep∗. The approaches primarily differ in the assumption of distribution of error ee. The first approach uses least square minimization of ee and the other approach uses the ME method. Both the LS and the ME approaches require a solution of a difficult constrained optimization problem. We use a GA based solution procedure to solve the hard constrained optimization problems. Our GA approach does not assume prior knowledge of expectation functions or probability distributions, and considers the value of known conditional probabilities while computing the values of unknown conditional probabilities. The computational efficiency of our GA procedure is O(ϑlogϑ)O(ϑlogϑ). The rest of the paper is organized as follows. In Section 2, we describe the principle of ME and relative entropy. In Section 3, we describe the ME and the LS problem formulations for solving a problem of missing values in CPT of a Bayesian network application from software engineering literature. In Section 4, we propose a GA based procedure for computing missing values in CPT. In Section 5, we apply the GA procedure to compute missing probabilities for a software engineering problem from the Pendharkar et al. (2005) study. In Section 6, we provide theoretical and experimental results of computational efficiency of our GA procedure for computing missing values in a CPT. Finally, in Section 7, we conclude the paper with a summary.