دانلود مقاله ISI انگلیسی شماره 29217
ترجمه فارسی عنوان مقاله

یادگیری موازی ساختار بهینه در سطح جهانی از شبکه های بیزی

عنوان انگلیسی
Parallel globally optimal structure learning of Bayesian networks
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
29217 2013 10 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Journal of Parallel and Distributed Computing, Volume 73, Issue 8, August 2013, Pages 1039–1048

ترجمه کلمات کلیدی
شبکه های بیزی - مدل های گرافیکی - یادگیری ماشین - آموزش ساختار - الگوریتم موازی -
کلمات کلیدی انگلیسی
Bayesian networks, Graphical models, Machine learning, Structure learning, Parallel algorithm,
پیش نمایش مقاله
پیش نمایش مقاله  یادگیری موازی ساختار بهینه در سطح جهانی از شبکه های بیزی

چکیده انگلیسی

Given nn random variables and a set of mm observations of each of the nn variables, the Bayesian network structure learning problem is to learn a directed acyclic graph (DAG) on the nn variables such that the implied joint probability distribution best explains the set of observations. Bayesian networks are widely used in many fields including data mining and computational biology. Globally optimal (exact) structure learning of Bayesian networks takes O(n2⋅2n)O(n2⋅2n) time plus the cost of O(n⋅2n)O(n⋅2n) evaluations of an application-specific scoring function whose run-time is at least linear in mm. In this paper, we present a parallel algorithm for exact structure learning of a Bayesian network that is communication-efficient and work-optimal up to View the MathML sourceO(1n⋅2n) processors. We further extend this algorithm to the important restricted case of structure learning with bounded node in-degree and investigate the performance gains achievable because of limiting node in-degree. We demonstrate the applicability of our method by implementation on an IBM Blue Gene/P system and an AMD Opteron InfiniBand cluster and present experimental results that characterize run-time behavior with respect to the number of variables, number of observations, and the bound on in-degree.

مقدمه انگلیسی

Bayesian networks (BNs) are graphical models which represent probabilistic relationships among interacting random variables of a given domain. In BNs, relationships between variables are depicted in the manner of conditional independences and quantitatively assessed by conditional probability distributions [9]. BNs provide a holistic view of the interactions within a domain and can be learned from data. They have been successfully applied in medical and fault diagnosis, bioinformatics, e-commerce, user preference prediction, spam filtering, etc., and have been widely used in recent decades [2], [6], [11], [12] and [24]. Formally, a Bayesian network is defined to consist of two components. Let PP specify the joint probability distribution over some set of random variables X={X1,X2,…,Xn}X={X1,X2,…,Xn}, and let N=(X,E)N=(X,E) be a directed acyclic graph (DAG). In NN, a node XjXj is called a parent of XiXi if an edge from XjXj to XiXi exists. A node XkXk is called a descendant of XiXi if there is a directed path from XiXi to XkXk and is termed a nondescendant of XiXi if no such path exists. The pair (N,P)(N,P) defines a Bayesian network if each variable in XX is independent of its nondescendants, given its parents, denoted by IP(Xi,ND(Xi)∣Pa(Xi))IP(Xi,ND(Xi)∣Pa(Xi)), where ND(Xi)ND(Xi) and Pa(Xi)Pa(Xi) denote the nondescendants and parents in NN of XiXi, respectively. This condition is known as the Markov assumption [16]. Given that (N,P)(N,P) satisfies the Markov assumption, and from the chain rule of probability, PP is decomposable in the following product form: View the MathML sourceP(X1,X2,…,Xn)=∏i=1nP(Xi∣Pa(Xi)). Turn MathJax on When learning the structure of a BN from data, the Bayesian approach utilizes a statistically motivated scoring function to evaluate the posterior probability of a considered graph, given the data [4] and [10]. One example of such a scoring function is the Bayesian score. Suppose we are given mm observations of the system, each observation consisting of the value of each of the nn random variables in XX. These observations can be viewed as an n×mn×m matrix Dn×mDn×m, with rows representing variables and columns representing observations. Then, the Bayesian score for a network NN given Dn×mDn×m is View the MathML sourceScore(N)=logP(N∣D)=logP(D∣N)+logP(N)+C, Turn MathJax on where CC is a constant. In this score the two non-constant terms logP(D∣N)logP(D∣N) and logP(N)logP(N) refer respectively to the log-likelihood of DD given NN and the prior probability of NN, usually taken to be uniform. To find the optimal network efficiently, it is crucial to choose a scoring function which decomposes into individual score contributions s(Xi,Pa(Xi))s(Xi,Pa(Xi)) of each of the variables in XX given its parents: View the MathML sourceScore(N)=∑Xi∈Xs(Xi,Pa(Xi)). Turn MathJax on Examples of Bayesian and information theory scoring functions include Bayesian Dirichlet (BD) [4], Bayesian Information Criterion (BIC) [20], and Minimum Description Length (MDL) scoring function [14]. More generally, such scoring functions are shown to, with high probability, assign superior scores to graphs which more precisely depict the dependences in the data [9]. BNs are said to be equivalent, and therefore indistinguishable, if they represent the same set of independences [8]. Finally, optimizing the scoring criterion facilitates the discovery of equivalent structures that best represent the observed data. A major difficulty in BN structure learning is the super exponential search space in the number of random variables. As reported by Robinson [19], for a set of nn variables there exist View the MathML sourcen!2n2(n−1)r⋅zn possible DAGs, where r≈0.57436r≈0.57436 and z≈1.4881z≈1.4881. To reduce the search space, marginalization over node orders has been proposed [13], [18] and [21]. Here we briefly review the main idea. Consider the graph N=(X,E)N=(X,E) of a BN (N,P)(N,P). Due to its DAG structure, the random variables in XX can be ordered in such a way that for each variable Xi∈XXi∈X its parents Pa(Xi)Pa(Xi) precede it in the specified ordering. Optimizing a scoring function for each variable Xi∈XXi∈X and all of its respective subsets of preceding elements allows us to discover the optimal network that respects the particular ordering. Further, to investigate all possible graphs we need to consider a total of n!n! orderings of XX. This yields the search space of ordered subsets of XX of size n!2nn!2n. However, as long as Pa(Xi)Pa(Xi) precedes XiXi, the order among the nodes in Pa(Xi)Pa(Xi) is irrelevant. Thus, the search space can be reduced to the 2n2n unordered subsets of XX. Even reduced, the search space remains exponential and in practice, exact structure learning without additional constraints has been achieved only for moderate size domains and at high execution cost. While exact network structure learning, or globally optimal structure learning, has been shown to be NP-hard even with bounded node in-degree [3], constructing exact Bayesian networks without additional assumptions remains valuable nevertheless. 1.2. Contributions In this paper, we present a parallel algorithm for exact Bayesian network structure learning with optimal parallel efficiency on up to View the MathML sourceO(1n⋅2n) processors regardless of the complexity of the scoring function used. We also investigate structure learning with user-specified bounded in-degree dd, which is the maximum allowed in-degree (or equivalently, the number of parents) of any node in the network. We show that for View the MathML sourced<13n−logmn the asymptotic run-time complexity remains unaffected as a function of dd. An important consequence of this result is that networks with a generous allowance of bounded in-degree can be learnt in the same time as networks where only constant in-degree is allowed. We also show that View the MathML sourced≥⌈n2⌉ results in the same asymptotic run-time complexity as d=n−1d=n−1, where no restriction is placed on the in-degree. To assess the performance of our algorithms experimentally, we report results on two parallel systems—IBM Blue Gene/P and AMD Opteron cluster with InfiniBand interconnect. Experimental results demonstrate the validity of the theoretical predictions. 1.3. Related work In the past, parallel algorithms have been developed for heuristics-based Bayesian network learning methods, particularly using meta-heuristics and sampling techniques [1] and [15]. Heuristics-based algorithms trade off optimality for the ability to learn larger networks. The presented work in exact Bayesian learning is intended to push the scale of networks for which optimal networks can be inferred. Our parallel algorithm is based on Ott et al.’s sequential algorithm [18] that takes O(n2n)O(n2n) steps and has a run-time and space complexity of O(n22n)O(n22n). To our knowledge, the presented work is the first parallel algorithm for exact learning, with an initial conference version published in 2009 [17]. In addition, the bounds established in this paper for run-time as a function of the maximum node in-degree are new both in sequential and parallel settings. In 2011, Tamada et al. [22] present a different parallel algorithm for globally optimal structure learning, also based on Ott et al.’s sequential algorithm. Because of the shared objective and as both parallel algorithms are based on the same sequential algorithm, we review the work in more detail here. Tamada et al.’s [22] parallel algorithm is based on an entirely different strategy than the one presented in this paper. Both approaches compute optimal networks for all 2n2n subsets of XX and take advantage of the structure of optimal solutions through dynamic programming as in Ott et al. [18]. Tamada et al. compute the optimal networks for all subsets of the same size in parallel. They exploit the observation that computation on a subset can be flexibly structured based on availability of computation on smaller subsets contained within it, and communication savings can be realized by scheduling the same size subsets with significant overlaps on the same processor. To reduce communication, the algorithm presents a trade-off where calculations are sometimes performed redundantly, thus increasing both overall work complexity and space complexity of the parallel algorithm. As in Ottet al., they present run-time and space complexities as the number of steps, while we prefer to report run-time and space complexities in terms of actual time and space. Multiplication by O(n)O(n) provides for correct translation from the number of steps to actual complexities. Tamada et al.’s algorithm has a work complexity of O(nσ+12n)O(nσ+12n) steps, which translates to a work and space complexity of O(nσ+22n)O(nσ+22n), for any integer σ≥1σ≥1. Thus, the algorithm performs O(nσ)O(nσ) more work and takes O(nσ)O(nσ) more space asymptotically over the optimal sequential algorithm, or our efficient parallelization of the sequential algorithm. An important practical issue in Tamada et al.’s strategy is that even though increasing σσ appears to increase the run-time and space complexities, thus supporting in theory the minimum value of σ=1σ=1, there is significant reduction in communication with the increase in σσ. Thus, they explore different values for σσ in a range of experiments and conclude that the total run-time is actually the smallest for σ=4σ=4, while memory always increases as a function of σσ. As this increase is quite drastic when increasing from σ=3σ=3 to σ=4σ=4, and the corresponding decrease in measured run-time is marginal, they suggest σ=3σ=3 as the optimal choice. Asymptotically, this choice translates to an O(n3)O(n3) increase in run-time (not realized in practice since σ=3σ=3 is actually faster than σ=1σ=1) and O(n3)O(n3) increase in space-complexity (realized in practice) when compared to the sequential algorithm. In addition, their scaling results demonstrate good scaling up to 256 processors and acceptable scaling at 512. In contrast, our parallel algorithm is a work-optimal parallelization of Ott et al.’s algorithm and does not limit concurrent computation to subsets of the same size only. In addition, it is also a space-optimal parallelization both in terms of total space complexity, and the space complexity per processor, the latter within a small constant factor of View the MathML source2. As the underlying problem has exponential run-time, it should be possible to achieve scalable parallelization when utilizing very large number of processors. Indeed, we demonstrate that an exponential number of processors can be used while maintaining optimal parallel efficiency (up to View the MathML sourceO(1n2n) processors) and demonstrate good scaling to 2048 processors even for a small problem size of n=24n=24. We report on a n=33n=33 run, which pushes the scale of what is achievable for exact structure learning. The remainder of this paper is organized as follows. In Section 2, we provide a brief overview of the Ott et al. sequential method for exact Bayesian network structure learning, upon which our parallel algorithm is based. Sections 3 and 4 contain our parallel algorithm, and proofs of correctness and run-time and space complexity analysis, respectively. In Section 5 we present scaling experiments on an IBM Blue Gene/P system and an AMD Opteron InfiniBand cluster by applying our algorithm to the problem of identifying gene regulatory networks, a problem that arises in systems biology. Structure learning with bounded in-degree is presented in Section 6, where we provide a characterization of the run-time complexity as a function of the pre-specified maximum node in-degree. We evaluate the empirical behavior of this algorithm in Section 7. Conclusions are presented in Section 8.

نتیجه گیری انگلیسی

Exact BN structure learning from data is a challenging problem due to the exponentially combinatorial search space. In this paper, we presented a parallel algorithm for this problem that is work-optimal and scalable. Experimental results confirm that the algorithm achieves good efficiency in practice. We also investigated when performance gain can be achieved by bounding node in-degree and evaluated the empirical behavior of the modified parallel algorithm.