انتشار سیگنال در شبکه های بیزی و ارتباط آن با متغیرهای پیش بینی چندمتغیره ذاتی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|29202||2013||17 صفحه PDF||سفارش دهید||8981 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Sciences, Volume 225, 10 March 2013, Pages 18–34
A set of predictor variables is said to be intrinsically multivariate predictive (IMP) for a target variable if all properly contained subsets of the predictor set are poor predictors of the target but the full set predicts the target with great accuracy. In a previous article, the main properties of IMP Boolean variables have been analytically described, including the introduction of the IMP score, a metric based on the coefficient of determination (CoD) as a measure of predictiveness with respect to the target variable. It was shown that the IMP score depends on four main properties: logic of connection, predictive power, covariance between predictors and marginal predictor probabilities (biases). This paper extends that work to a broader context, in an attempt to characterize properties of discrete Bayesian networks that contribute to the presence of variables (network nodes) with high IMP scores. We have found that there is a relationship between the IMP score of a node and its territory size, i.e., its position along a pathway with one source: nodes far from the source display larger IMP scores than those closer to the source, and longer pathways display larger maximum IMP scores. This appears to be a consequence of the fact that nodes with small territory have larger probability of having highly covariate predictors, which leads to smaller IMP scores. In addition, a larger number of XOR and NXOR predictive logic relationships has positive influence over the maximum IMP score found in the pathway. This work presents analytical results based on a simple structure network and an analysis involving random networks constructed by computational simulations. Finally, results from a real Bayesian network application are provided.
Bayesian networks  and  have been used as a useful approach to model systems composed of components that communicate by local interaction, i.e. each component directly depending on a small number of elements. Biological systems, for instance, present such property . Bayesian networks are mathematically defined in terms of probabilities and conditional independence properties and can be employed to infer direct “causal” influence (connections between variables) , , ,  and . The concept of intrinsically multivariate predictive (IMP) variables was introduced in , in which a target variable strongly depends on a set of other variables, but such dependence is weak or absent when one considers properly contained subsets of variables. The IMP score was introduced as a metric based on the coefficient of determination (CoD)  and  as a measure of predictiveness with respect to the target variable. It was shown in  that the IMP score of a target variable is affected by four properties: logic of prediction, predictive power, covariance between the predictors, and the marginal probabilities of each individual predictor. It was demonstrated that IMP variables (i.e., variables with large IMP score) tend to occur for large predictive power, small correlation between predictors, and certain specific predictor logics—2-minterm logics (XOR and NXOR ) lead to larger IMP scores than 1-and 3-minterm logics (View the MathML sourceAND,OR,NOR,NAND,x1∧x¯2 and View the MathML sourcex1∨x¯2). Based on these results, we hypothesized that large proportions of nodes with XOR logic of prediction in the networks could improve the chance for the appearance of nodes with large IMP score. We show in this paper that this is indeed the case; the larger the number of XOR/NXOR logics in the network is, the larger the maximum IMP score in the network is. The study of the IMP phenomenon can be useful in feature selection for pattern recognition, since it is one of the main reasons for the occurrence of the nesting effect. Basically, the nesting effect is a feature selection issue that occurs when some features included in the partial subset solution by some algorithm are not present in the optimal solution and never discarded, leading to a suboptimal solution . Another application of IMP is that it seems to be associated with variables that possess canalizing functions ,  and , an important concept in Systems Biology—canalizing genes exhibit key roles on gene regulatory networks . Martins et al. showed that DUSP1 gene, which is canalizing gene exhibiting control over a central, process-integrating signaling pathway, displays the largest number of IMP predictors in melanoma expression data . Besides, Bayesian networks are often applied to financial risk analysis in order to model conditional multivariate dependence among variables  and . In this paper, we analyze the intrinsically multivariate prediction phenomenon in networks with three or more nodes, an extension of the study presented in  which considers only one target and its set of predictors (two or three). In particular, we analyze how the territory size of a target node (a graph-theoretical property defined in Section 4) impacts the probability of occurrence of IMP nodes in Bayesian networks with Boolean variables. We show that a target with large territory can achieve larger IMP scores with its predictors than a target with small territory. This finding is in agreement with the hypothesis, advanced in , that subsets with high IMP score are more susceptible to be responsible for regulation of several metabolic pathways or subsystems as observed in microarray data analysis of melanoma experiments. We also show that the absolute value of the covariance between predictors is negatively correlated with the territory size. It is worth mentioning that, although these results are given in the context of logical functions, they can be easily extended to other types of functions. In summary, this paper contributes to theoretical advances in the analysis of the intrinsically multivariate prediction phenomenon in the context of Bayesian networks. This work is organized as follows. Section 2 reviews fundamental concepts. Section 3 describes the network model used to analyze the IMP score behavior as a function of the territory size of a given target. Section 4 presents analytical results based on a simple structure network. In order to generalize the analytical results, Section 5 presents an analysis of the IMP score in random networks constructed by computational simulations, as well as a real example from the Bayesian networks Repository (http://www.cs.huji.ac.il/site/labs/compbio/Repository). Finally, conclusions are given in Section 6.
نتیجه گیری انگلیسی
In this paper we have analyzed the intrinsically multivariate prediction phenomenon in networks with three or more nodes, an extension of the study presented in  which considers only one target and its set of predictors. In order to measure the intrinsically multivariate predictiveness, we employed the IMP score, a metric based on the coefficient of determination (CoD). We have derived analytical formulas of the IMP score in terms of the territory size of a given node in a particular network structure following the Bayesian network model. In order to study more general Bayesian networks, we have conducted an experimental study based on simulated networks, and the results corroborate the analytical results obtained for the particular network structure. The main finding of this paper is that the IMP score between a target node and its predictors (directly connected with the target) tends to grow with the territory size. Such tendency was also found in a real Bayesian network application. This tendency is explained by the fact that the absolute value of the covariance between predictors of a given target decreases with the territory size of the considered target, which is consistent with the results found in  (large absolute covariance tends to reduce the IMP score). Moreover, such a trend is very relevant if one is interested in biological networks, which usually have a large number of nodes and, hence, a high probability of presenting nodes with large territories (e.g. gene networks with thousands of genes). Naturally, it is possible to find nodes with large IMP scores without large territories, but if there are more nodes with large territories, then large IMP scores become more likely, because a low covariance between the predictors (possibly caused by the large target territory as shown in Fig. 15) tends to impact positively their IMP score with respect to the target. The asymptotic IMP score was found to depend on the number of XOR logics and the predictive power in the network. This suggests a kind of capacity, the maximum (asymptotic) IMP score, of the network from which classes of networks and their properties may be investigated. For example, from Fig. 5 different networks (with different predictive power and number of XOR gates) that converge to the same IMP score may be found. This work presented theoretical advances in the analysis of the intrinsically multivariate prediction concept in Bayesian networks. Based on the results presented here, methods could be proposed to find intrinsically multivariate predicted variables in Bayesian networks. In order to accomplish this task, such methods could be guided to look for variables with large territories or XOR logics. The development of a technique to discover IMP variables based on the theoretical findings revealed in this paper can be considered for future work.