روش شبکه های بیزی برای یادگیری پارامتر چندگانه با استفاده از داده ها و قضاوت کارشناسی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
29306 | 2014 | 17 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : International Journal of Approximate Reasoning, Volume 55, Issue 5, July 2014, Pages 1252–1268
چکیده انگلیسی
One of the hardest challenges in building a realistic Bayesian Network (BN) model is to construct the node probability tables (NPTs). Even with a fixed predefined model structure and very large amounts of relevant data, machine learning methods do not consistently achieve great accuracy compared to the ground truth when learning the NPT entries (parameters). Hence, it is widely believed that incorporating expert judgments can improve the learning process. We present a multinomial parameter learning method, which can easily incorporate both expert judgments and data during the parameter learning process. This method uses an auxiliary BN model to learn the parameters of a given BN. The auxiliary BN contains continuous variables and the parameter estimation amounts to updating these variables using an iterative discretization technique. The expert judgments are provided in the form of constraints on parameters divided into two categories: linear inequality constraints and approximate equality constraints. The method is evaluated with experiments based on a number of well-known sample BN models (such as Asia, Alarm and Hailfinder) as well as a real-world software defects prediction BN model. Empirically, the new method achieves much greater learning accuracy (compared to both state-of-the-art machine learning techniques and directly competing methods) with much less data. For example, in the software defects BN for a sample size of 20 (which would be considered difficult to collect in practice) when a small number of real expert constraints are provided, our method achieves a level of accuracy in parameter estimation that can only be matched by other methods with much larger sample sizes (320 samples required for the standard machine learning method, and 105 for the directly competing method with constraints).
مقدمه انگلیسی
Bayesian Networks (BNs) [1] and [2] are the result of a marriage between graph theory and probability theory, which enable us to model probabilistic and causal relationships for many types of decision-support problems. A BN consists of a directed acyclic graph (DAG) that represents the dependencies among related nodes (variables), together with a set of local probability distributions attached to each node (called a node probability table – NPT – in this paper) that quantify the strengths of these dependencies. BNs have been successfully applied to many real-world problems [3]. However, building realistic and accurate BNs (which means building both the DAG and all the NPTs) remains a major challenge. For the purpose of this paper, we assume the DAG model is already determined, and we focus purely on the challenge of building accurate NPTs. In the absence of any relevant data NPTs have to be constructed from expert judgment alone. Research on this method focuses on the questions of design, bias elimination, judgments elicitation, judgments fusion, etc. (see [4] and [5] for more details). At the other extreme NPTs can be constructed from data alone, whereby a raw dataset is provided in advance, and statistical based approaches are applied to automatically learn each NPT entry. In this paper we focus on learning NPTs for nodes with a finite set of discrete states. For a node with riri states and no parents, its NPT is a single column whose riri cells corresponding to the prior probabilities of the riri states. Hence, each NPT entry can be viewed as a parameter representing a probability value of a discrete distribution. For a node with parents, the NPT will have qiqi columns corresponding to each of the qiqi instantiations of the parent node states. Hence, such an NPT will have qiqi different riri-value parameter probability distributions to define or learn. Given sufficient data, these parameters can be learnt, for example using the relative frequencies of the observations [6]. However, many real-world applications have very limited relevant data samples, and in these situations the performance of pure data-driven methods is poor [7]; indeed pure data-driven methods can result in poor results even when there are large datasets [8]. In such situations incorporating expert judgment improves the learning accuracy [9] and [10]. It is the combination of (limited) data and expert judgment that we focus on in this paper. A key problem is that it is known to be difficult to get experts with domain knowledge to provide explicit (and accurate) probability values. Recent research has shown that experts feel more comfortable providing qualitative judgments and that these are more robust than their numerical assessments [11] and [12]. In particular, parameter constraints provided by experts can be integrated with existing data samples to improve the learning accuracy. Niculescu [13] and de Campos [14] introduced a constrained convex optimization formulation to tackle this problem. Liao [15] regarded the constraints as penalty functions, and applied the gradient-descent algorithm to search the optimal solution. Chang [16] and [17] employed constraints and Monte Carlo sampling technology to reconstruct the hyperparameters of Dirichlet priors. Corani [18] proposed the learning method for Credal networks, which encodes range constraints of parameters. Khan [19] developed an augmented Bayesian network to refine a bipartite diagnostic BN with constraints elicited from expert's diagnostic sequence. However, Khan's method is restricted to special types of BNs (two-level diagnostic BNs). Most of these methods are based on seeking the global maximum estimation over reduced search spaces. A major difference between the approach we propose in this paper and previous work is in the way to integrate constraints. We incorporate constraints in a separate, auxiliary BN, which is based on the multinomial parameter learning (MPL) model. Our method can easily make use of both the data samples and extended forms of expert judgment in any target BN; unlike Khan's method, our method is applicable to any BN. For demonstration and validation purposes, our experiments (in Section 4) are based on a number of well-known and widely available BN models such as Asia, Alarm and Hailfinder, together with a real-world software defects prediction model. To illustrate the core idea of our method, consider the simple example of a BN node (without parents) VA (“Visit to Asia?”) in Fig. 1. This node has four states, namely “Never”, “Once”, “Twice” and “More than twice” 1 and hence its NPT can be regarded as having four associated parameters P1P1, P2P2, P3P3 and P4P4, where each is a probability value of the probability distribution of node VA . Whereas an expert may find it too difficult to provide exact prior probabilities for these parameters (for a person entering a chest clinic) they may well be able to provide constraints such as: “P1>P2P1>P2”, “P2>P3P2>P3” and “P3≈P4P3≈P4”. These constraints look simple, but are very important for parameter learning with small data samples. Full-size image (20 K) Fig. 1. The overview of parameter learning with constraints and data. The constraints are: C1: P1>P2P1>P2, C2: P2>P3P2>P3 and C3: P3≈P4P3≈P4. The gray color nodes represent part of the MPL model. Figure options Fig. 1 gives an overview of how our method estimates the four parameters with data and constraints (technically we only need to estimate 3 of the parameters since the 4 parameters sum to 1). Firstly, for the NPT column of the target node (dashed callout in Fig. 1), our method generates its auxiliary BN, where each parameter is modeled as a separate continuous node (on scale 0 to 1) and each constraint is modeled as a binary node. The other nodes correspond to data observations for the parameters – the details, along with how to build the auxiliary BN model – are provided in Section 3. It turns out that previous state-of-the-art BN techniques are unable to provide accurate inference for the type of BN model that the auxiliary BN is. Hence, we describe a novel inference method and its implementation. With this method, after entering evidence, the auxiliary BN can be updated to produce the posterior distributions for the parameters. Finally, the respective means of these posterior distributions are assigned to entries in the NPT. A feature of our approach is that experts are free to provide arbitrary constraints on as few or as many parameters as they feel confident of. Our results show that we get greatly improved learning accuracy even with very few expert constraints provided. The rest of this paper is organized as follows: Section 2 discusses the background of parameter learning for BNs, and introduces related work of learning with constraints; Section 3 presents our model, judgment categories and inference algorithm; Sections 4 describes the experimental results and analysis; Section 5 provides conclusions as well as identifying areas for improvement.
نتیجه گیری انگلیسی
Purely data driven techniques (such as MLE and MAP) for learning the NPTs in BNs often provide inaccurate results even when the datasets are very large. That is why it is widely accepted that expert judgment should be used whenever it is available in order to supplement and improve the learning accuracy. However, reliable expert judgment is not only difficult to elicit, but also difficult to incorporate with existing data. We have described an automated method that addresses both of these concerns. It focuses on constraints that are easily described by experts and are incorporated into an extended version of a multinomial parameter learning model. This model is an auxiliary BN associated with each node whose NPT we wish to learn. Because the auxiliary BN is a hybrid model, meaning it contains a mixture of discrete and (non-Normally distributed) continuous nodes, previous state-of-the-art BN inference algorithms and tools do not provide the necessary mechanism for accurate inference given data in the presence of the constraints in the model. Hence, we have developed a novel inference procedure (and implementation) which exploits recent work on dynamic discretization algorithms for hybrid BNs. The resulting method, which we called MPL-C (multinomial parameter learning with constraints) was implemented (using the API of the BN software tool AgenaRisk) and evaluated experimentally against both pure data learning methods (namely MLE and MAP) as well as against the only relevant competing method that incorporates expert judgments with data (namely CO). We have conducted experiments on six standard BN models of varying size and complexity that come from a BN repository whose models are widely used for evaluating learning techniques. In addition, we conducted experiments on a real-world BN model that has been used by many technology companies worldwide (for predicting software defects). In all cases our objective was to compare the learnt NPTs and their predictions against the ‘ground truth’ model (so we were assuming the NPTs as defined in the repository and in relevant papers constituted the ground truth) using the K–L measure. In all of the experiments we generated sample datasets – of different sizes – based on the ‘ground truth’ model NPTs, and in all of the experiments we considered the impact of adding expert constraints. In one of the standard models (Asia) and in the software defects model we elicited a small number of constraints from real experts. In the other standard models we generated random synthetic constraints. The experiments demonstrate that, whereas the CO method clearly improves performance compared with conventional MLE and MAP algorithms when enough constraints are added, our MPL-C method achieves the best learning results in almost all experiment settings. MPL-C is especially accurate in comparison to the other methods in situations where the datasets are relatively small. Indeed, MPL-C needs much smaller datasets to achieve accurate learning results. Moreover, even a very small number of expert constraints dramatically improves accuracy under MPL-C. The practical implications of these results are very important: in most real-world situations (such as that in which the software defects prediction BN is used) the availability of relevant data is extremely limited; even when it is possible to get relevant data it may be too expensive or time consuming to do so. Our results show that in these very common situations a small dataset, together with even a small number of expert judgment constraints, can result in accurate models using MPL-C. Indeed the accuracy can be much greater than what is achievable with a very large dataset and no expert constraints. Hence, the experimental results suggest that our MPL-C method may represent a significant step forward in parameter learning of BNs in the very common situation where there is sparse data but a small amount of expert judgment. While the experimental results are extremely promising, there are obvious areas for improvement and future work. We highlight four of these: Making better use of the learnt distributions. The MPL-C model actually learns the full posterior distribution for each parameter of interest. However, because the parameter corresponds to a probability value in a discrete NPT of the original BN, we currently simply take the mean of the posterior distribution as the NPT parameter value. Hence, we are ‘throwing away’ all the other information in the learnt distributions (such as the variance and percentiles). To avoid this loss of information we could, in principle, add an associated continuous node to the discrete node in the original BN. However, there are major practical and computational challenges involved in doing this. Supporting broader types of constraints. In this paper we have considered only constraints that affect a single column of NPT values. However, experts may be able to provide constraints between parameters in different columns (for example: in the Asia BN, an expert could assert that the probability a patient has lung cancer given they smoke is greater than the probability given they do not smoke). They may even be able to provide constraints between parameters in different NPTs (for example, an expert may assert the probability a person is a smoker is much greater than the probability they visited Asia). Given the remarkable boost in accuracy achieved by incorporating the limited types of constraints considered in the paper, it is reasonable to conclude that incorporating these other types would also lead to greatly improved learning accuracy. Incorporating notion of trustworthiness about data and expert judgments. All of the NPT learning methods we have discussed assume that both the data and the expert judgments are reliable. In practice there will be ‘second level’ judgments about the trustworthiness of both. For example, some datasets will be known to be less representative than others, while some expert constraints will (even on the admission of the experts who provided them) be less trustworthy than others. Ideally, therefore, we need a “trustworthiness” mechanism to handle knowledge and data from different sources to increase the robustness of the method. Fully automating the method in BN toolsets. Ideally, the MPL-C method should be available as part of a standard BN tool. In other words, it should be possible within a single BN tool GUI to build a BN model, import both a dataset and a set of experts constraints relevant to the BN, and then have the NPTs in the BN automatically updated according to the MPL-C method calculations. Although we have implemented the MPL-C method using the API of a standard BN tool, its integration into that tool's GUI would be a major challenge.