الگوریتم های یادگیری ساختاری شبکه های بیزی مبتنی بر همبستگی جزئی تحت SEM خطی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
29139 | 2011 | 14 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Knowledge-Based Systems, Volume 24, Issue 7, October 2011, Pages 963–976
چکیده انگلیسی
A new algorithm, the PCB (partial correlation-based) algorithm, is presented for Bayesian network structure learning. The algorithm effectively combines ideas from local learning with partial correlation techniques. It reconstructs the skeleton of a Bayesian network based on partial correlation and then performs a greedy hill-climbing search to orient the edges. Specifically, we make three contributions. First, we prove that in a linear SEM (simultaneous equation model) with uncorrelated errors, when the datasets are generated by linear SEM, subject to arbitrary distribution disturbances, we can use partial correlation as the criterion of the CI test. Second, we perform a series of experiments to find the best threshold value of the partial correlation. Finally, we show how partial correlation can be used in Bayesian network structure learning under linear SEM. The effectiveness of the method is compared with current state of the art methods on eight networks. A simulation shows that the PCB algorithm outperforms existing algorithms in both accuracy and run time.
مقدمه انگلیسی
Bayesian networks (BN) are widely used to represent probabilistic relationship among random variables. They have been successfully applied to many domains such as medical diagnosis, gene data analysis, and hardware troubleshooting, rare event predictions, scenario analysis [28], [4] and [6]. Learning the structure of a Bayesian network from a dataset D is useful; unfortunately, it is an NP-hard problem [5]. Consequently, many heuristic techniques have been proposed. One of the most basic search algorithms is a local greedy hill-climbing search over all DAG structures. The size of the search space of the greedy search is a super-exponential function of the number of variables. One approach is to place constraints on the search to improve its efficiency, as in the K2 algorithm [7], the SC algorithm [10], the MMHC algorithm [25], and the L1MB algorithm [18]. One drawback of the K2 algorithm is that it requires a total variable ordering. The SC algorithm is based on the idea of local learning and uses a two-phase framework including a Restriction step and a Search step. In the Restriction step, the SC algorithm uses mutual information to find a set of potential neighbors (parents and children) for each node and achieves fast learning by restricting the search space. One drawback of the SC algorithm is that it only allows a variable to have a maximum of up to k parents. However, a common parameter k for all nodes will have to sacrifice either efficiency or quality of reconstruction [25]. The MMHC algorithm uses the max–min parents–children (MMPC) algorithm to identify a set of potential neighbors [25]. Our experiments show that the MMHC algorithm has high accuracy, but one drawback of it is that it requires conditional independency tests on exponentially large conditioning sets. Therefore, the MMHC algorithm is very slow on high dimensional complex networks. The L1MB algorithm uses L1 techniques to learn DAG structure and uses the LARS algorithm [9] to find a set of potential neighbors [18]. The L1MB algorithm has good time performance. However, the L1MB algorithm evaluates the effects of a set of variables, not a single variable. The method can describe the correlation between a set of variables and a variable but not the correlation between two variables. It is not reasonable to use this method to select potential neighbors, and our experiments show that the L1MB algorithm has low accuracy. In fact, many algorithms, such as the K2, SC, PC [22], TPDA [3], and MMHC, can be implemented efficiently with discrete variables but are not directly applicable to continuous variables. Some algorithms including the SC, PC, and TPDA, have been designed for discrete variables. Even though they can be used for continuous variables, our experiments show that they have many structural errors. The L1MB algorithm has been designed for continuous variables. However, it uses L1 regression to find a set of potential neighbors for one variable once. However, it cannot precisely capture the causal relation between two variables, so the selection of potential neighbors is somewhat unreasonable, and our experiments show that its accuracy is relatively low. In this paper, we propose a new heuristic technique using local learning, and we show experimentally that it outperforms several existing approaches for continuous and binary variables. The correlations among multiple correlative variables are complex. To a certain extent, there is a correlation between any two variables, but that correlation is affected by the other correlative variables. The simple correlation method does not consider these influences, so it cannot reveal the true correlation between two variables. The true correlation between two variables can be obtained only after the influences of the other correlative variables are removed. The partial correlation method can eliminate the influences of other correlative variables by holding them unchanged in the analysis and thus reveal the true correlation between the two variables of interest [27]. For example, the findings in [27] suggest that the simple correlation coefficient between NmF2 and h(100) is affected by other influence factors and therefore cannot reveal the true correlation between NmF2 and h(100); the partial correlation method can eliminate the influences of F107, Ap and the seasonal variation factors and can thus reveal the true correlation between the two variables of interest by eliminating the influences of the other correlative variables. Partial correlation has been widely used to describe the relative importance of variables in multivariate regression analysis, and it has been successfully applied to many fields such as medicine [13] and [26], economics [23], and geology [27]. In causal discovery, it has been used (as transformed by Fisher’s z [17]) as a continuous replacement for CI tests in the PC algorithm. Pellet et al. introduced the partial-correlation-based CI test into causation discovery, on the assumption that the data follow a multivariate Gaussian distribution for continuous variables [14]. However, when the data do not follow a multivariate Gaussian distribution, can partial correlation be used as a CI test? Our first contribution is to prove that partial correlation can be used as the criterion for a CI test under the linear simultaneous equation model (SEM), which includes the multivariate Gaussian distribution as a special case. Our second contribution is that we propose an effective algorithm, called PCB (partial correlation-based), which effectively combines ideas from local learning with partial correlation techniques. The PCB algorithm works in the continuous or binary variable settings under the assumption data generated by linear SEM. The computational complexity of PCB is O(3mn2 + n3)(where n is the number of variables and m is the number of cases). One advantage of PCB is its time performance. The time complexity of our PCB is bounded by a polynomial in the number of variables. Another advantage of the PCB algorithm is its quite high accuracy. The third advantage of the PCB algorithm is that it uses a relevance threshold to evaluate the correlation to alleviate the drawback of SC algorithm (common parameter k for all nodes), and we also find the best relevance threshold by a series of extensive experiments. Empirical results show that PCB outperforms the above existing algorithms in both accuracy and time performance. The remainder of the paper is structured as follows. In Section 2, we present the background of learning structure. In Section 3, we present the PCB algorithm and its computational complexity analysis. Some empirical results are presented and discussed in Section 4. Finally, we conclude this work and address some issues for future work in Section 5.
نتیجه گیری انگلیسی
The contributions of this paper are two-fold. (1) Partial correlation is a valid CI measure under the assumption that data follows a multivariate Gaussian distribution. In this paper, we prove that partial correlation can be used as a CI test under linear SEM, which includes the multivariate Gaussian distribution as a special case. We redefine the strong relevance and weak relevance. Based on a series of experiments, we find the best relevance threshold. (2) In this paper, we propose the PCB algorithm, by introducing partial correlation into the two-phase heuristic algorithm so as to make it able to deal with continuous and binary data. Theoretical analysis and empirical results show that the PCB algorithm performs better than the other existing algorithms on both accuracy and run time. Significant efforts will be required to extend our work. We have only proposed the best relevance threshold based on empirical results, and we intend to study the problem further from a theoretical perspective. We will also extend our algorithm to higher dimensions and larger datasets.