یادگیری افزایشی از شبکه های بیزی حفاظت از حریم خصوصی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
29220 | 2013 | 11 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Applied Soft Computing, Volume 13, Issue 8, August 2013, Pages 3657–3667
چکیده انگلیسی
Bayesian Networks (BNs) have received significant attention in various academic and industrial applications, such as modeling knowledge in image processing, engineering, medicine and bio-informatics. Preserving the privacy of sensitive data, owned by different parties, is often a critical issue. However, in many practical applications, BNs must train from data that gradually becomes available at different period of times, on which the traditional batch learning algorithms are not suitable or applicable. In this paper, an algorithm based on a new and efficient version of Sufficient Statistics is proposed for incremental learning with BNs. The standard K2K2 algorithm is also modified to be utilized inside the incremental learning algorithm. Next, some secure building blocks such as secure comparison, and factorial, which are resistant against colluding attacks and could be applied securely over public channels like internet, are presented to be used inside the main protocol. Then a privacy-preserving protocol is proposed for incremental learning of BNs, in which the structure and probabilities are estimated incrementally from homogeneously distributed and gradually available data among two or multi-parties. Finally, security and complexity analysis along with the experimental results are presented to compare with the batch algorithm and to show its performance and applicability in real world applications.
مقدمه انگلیسی
Bayesian Networks are probabilistic graphical models [24] that are trained to represent the relationship between variables from a dataset [28]. Medical diagnosis applications, fraud detection systems, and financial networks widely utilize such networks to create models and make decision according to the probabilistic independencies among the variables of the underlying databases [29]. For instance, Fenton and Neil in [10] show the successful application of Bayesian Networks in risk management, and Ji et al. in [16] use Bayesian Networks for optimization problems. Also, Yang et al. in [17] utilize Bayesian Networks for optimization in heterogeneous computing environments. According to the privacy regulations such as Freedom of Information and Protection of Privacy Act (FIPPA) [39] in Canada, or the Health Insurance Portability and Accountability Act (HIPAA) [40] in the United States, individual's private and sensitive data must be secured when protocols are applied on data used to train BNs. To create the BNs structure and parameters using training data which is securely shared among two or more parties, they cannot simply present their own private data to each other, or even to a third party to run a learning algorithm on the whole data. Therefore, privacy-preserving protocols are needed to apply in these situations. In many practical applications, BNs must be trained using data that becomes available at different points in time. The traditional techniques for training BNs (e.g. K2K2 algorithm) are batch in nature, and are not suitable for training on data that arrives incrementally. To obtain a high level of performance, using a batch technique would involve accumulating all training data in memory, and recreating a new BN from scratch using all cumulative data. The time and memory complexity of retraining on all data would be prohibitive in applications with large amounts of training data. For instance, selling records in Walmart, as a chain of large stores, are gradually growing everyday and it is not reasonable to store all data and run the data mining and machine learning algorithms on all data every time a block of new data becomes available. This is an ubiquitous scenario in many different fields such as healthcare systems, government applications and so on. Therefore, incremental learning is needed to efficiently update BNs on new data, in terms of data storage and processing time. In this paper, BNs structure is incrementally constructed each time a block of new training data is available by updating the sufficient statistics of the existing network structure, and a new structure is created accordingly. Note that by updating sufficient statistics, the probability table of each node could also be computed. After reviewing the existing techniques for incremental learning of BNs, first an improved version of sufficient statistics, in terms of required storage and search time, is proposed followed by a specialized K2K2 algorithm to be applied in our incremental learning algorithm for BNs. After presenting that algorithm, a privacy-preserving protocol and required secure building blocks, such as secure comparison, and factorial, are proposed along with their security and complexity analysis and experimental results. Fig. 1 illustrates the contribution of the paper and the relations of its components. Inside the main protocol of privacy-preserving incremental BNs, the new incremental learning algorithm is used to update the BNs structure by using an improved version of sufficient statistics and calling the specialized K2K2 algorithm. Also secure building blocks are utilized inside the protocol to maintain the privacy of the parties involved. These building blocks, which are indicated as a bold box in Fig. 1, are previously proposed by the same authors in [32] and [33]. Full-size image (17 K) Fig. 1. Overview of the proposed privacy-preserving incremental Bayesian network learning. Figure options As a summary, each time a block of new data becomes available, Incremental BNs algorithm is called with the ordered list of nodes, sufficient statistics of the previous data, the set of previous candidate lists of parents, and the new data. Its output is a directed acyclic graph for the BNs and the new updated sufficient statistics. Inside this algorithm, the specialized K2K2 algorithm is called when needed with the current node, its parents set, sufficient statistics of that node, its predecessors, and the set of candidate parents lists as the inputs. The output of this algorithm will be the new parents set of the node and its updated candidate parents lists. Each time we need to compute the score function in those two algorithms, secure building blocks are used to securely compute this function without revealing private data of each party to the others. With this privacy-preserving protocol, it is assumed that data is homogenously shared and owned by several parties, while these parties want to keep their sensitive data private. The protocol is also secure against colluding attacks and could be run over public channels. We show the applicability and efficiency of the proposed protocol by testing it against several different datasets, various number of parties involved and reasonable encryption key sizes, 512, 1024 and 2048 bits, to keep the privacy of the protocol strong. Also, the comparison of the final results of the incremental learning with the batch algorithm, in terms of efficiency and accuracy, shows its great promise to be applied in real world applications. The rest of the paper is organized as follows. In Section 2, the BN is briefly introduced, along with a brief survey of training algorithms and privacy-preserving BNs. An improved version of sufficient statistics, an incremental algorithm for learning BN structure and a modified K2K2 [9] algorithm which could exploit that algorithm presented in Section 3. A privacy-preserving protocol for Incremental Bayesian Networks is presented in Section 4, followed by the experimental results in Section 5.
نتیجه گیری انگلیسی
Incremental learning algorithm for stream and online data to construct Bayesian Networks is considered in this paper when data is privately and horizontally shared among two or more parties, by improvement on computation of Sufficient Statistics and providing an incremental and efficient version of K2K2 algorithm. Secure building blocks, such as secure product comparison and factorial, which are resistant to colluding attacks and are secure in dedicated as well as public channels are also proposed to utilize in the main protocol to preserve the privacy of the parties involved. Experimental results indicate that the efficiency will significantly increase by using the new version of incremental K2K2 algorithm an improved computation of Sufficient Statistics. Applying the proposed algorithms and protocols on real-world applications, in which privacy and efficiency are crucial factors and data is gradually growing, would be suitable to show the applicability of these methods and to find obstacles and new directions to improve these types of privacy-preserving protocols.