اجرای FPGA بهینه سازی ازدحام ذرات برای یادگیری شبکه های بیزی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
29273 | 2013 | 15 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computers & Electrical Engineering, Volume 39, Issue 8, November 2013, Pages 2454–2468
چکیده انگلیسی
Using a Bayesian network (BN) learned from data can aid in diagnosing and predicting failures within a system while achieving other capabilities such as the monitoring of a system. However, learning a BN requires computationally intensive processes. This makes BN learning a candidate for acceleration using reconfigurable hardware such as field-programmable gate arrays (FPGAs). We present a FPGA-based implementation of BN learning using particle-swarm optimization (PSO). This design thus occupies the intersection of three areas: reconfigurable computing, BN learning, and PSO. There is significant prior work in each of these three areas. Indeed, there are examples of prior work in each pair among the three. However, the present work is the first to study the combination of all three. As a baseline, we use a prior software implementation of BN learning using PSO. We compare this to our FPGA-based implementation to study trade-offs in terms of performance and cost. Both designs use a master–slave topology and floating-point calculations for the fitness function. The performance of the FPGA-based version is limited not by the fitness function, but rather by the construction of conditional probability tables (CPTs). The CPT construction only requires integer calculations. We exploit this difference by separating these two functions into separate clock domains. The FPGA-based solution achieves about 2.6 times the number of fitness evaluations per second per slave compared to the software implementation.
مقدمه انگلیسی
A Bayesian network (BN) [1] is a graphical model that expresses probabilistic relationships between variables. BNs can be used to model complex real-world systems. If the underlying relationship between variables in a system is not yet known, the structure of a BN can be learned from observed data. BN learning uses a function that assigns a score to each candidate network, a definition of the search space, and a search algorithm for optimization [2]. The approach to finding the best structure for a BN is an NP-Hard problem [3] and, therefore, is computationally expensive, because the number of possible structures is super-exponential in the number of variables. Because of this problem, a number of search strategies have been introduced including deterministic heuristics [4], genetic algorithms (GAs) [5], and particle-swarm optimization (PSO) [6]. Salman et al. [7] show that PSO outperforms GA with respect to convergence rate and score of a network. In this paper, we focus on PSO. PSO is a method for the optimization of multidimensional continuous nonlinear functions. Each potential solution to the function is represented by a position in a predefined search space. PSO explores the search space using position and velocity equations based on previous knowledge and randomness. These equations give the swarm a diverse response. Sahin et al. [2] and Yavuz et al. [8] combine BN learning with PSO for fault diagnosis. Using a master–slave topology, the slave calculates how well a potential BN solution matches the data set using a fitness function, and the master then moves the position of the solution within the search space. Calculating the fitness of a potential BN solution requires a significant amount of time for processing in relation to the rest of the BN learning algorithm. Hardware programming in field-programmable gate arrays (FPGAs) can dramatically improve the speed of an implementation by taking advantage of processes that can be computed in parallel. FPGAs are a type of reconfigurable hardware that can achieve better performance than software while allowing more flexibility than hardware [9]. A FPGA can be re-programmed multiple times by altering configuration bits. This allows a significant amount of flexibility to support many applications [10] and [11]. There has been prior work, in particular, in improving BN learning [12] and PSO [13] performance on FPGAs. Pournara et al. [12] show that the performance of BN learning can be increased significantly, but the size of the network learned is constrained by the size of the FPGA used. Asadi et al. [14] and [15] present a reconfigurable system for BN learning that achieves four orders of magnitude speedup over software for real-world networks. Their implementation relies on a pre-processing step that uses supervised learning to greatly reduce the search space. However, Linderman et al. [16] present an implementation using both FPGAs and graphics-processing units (GPUs) that achieves comparable performance to the FPGA system of [14] at a fraction of the cost. Reynolds et al. [13] use two FPGAs to decrease the time it takes a PSO algorithm to complete a set number of iterations, where one FPGA is used for position and velocity equations to update the swarm, and the other calculates the fitness. Evaluating the update equations of the swarm takes little time compared to evaluating the fitness function. Reynolds et al. [13] state that using chips that are not identical could greatly increase the speed of execution of the PSO even further by giving the fitness function more resources and better hardware than the update equations. In this present work, a hardware design of BN learning using PSO as the search algorithm is implemented on a FPGA. The starting point for the FPGA-based implementation is the software implementation of Sahin et al. [2] and [8]. Sahin et al. take advantage of the parallel nature of PSO by using a master–slave topology in which the slaves work independently of each other and perform the fitness computations necessary to determine how well a BN fits a given data set. Each slave is considered a particle in the swarm. The master makes the decision to keep or reject each returned score and stores the best score of each slave. The particle swarm update equations are evaluated, and the new positions are sent to each slave. This master–slave topology is retained in the hardware design. It takes a significant amount of time to process each network and calculate the fitness so that the master can make a decision and start the next iteration. For this reason, the slave is first placed into hardware, and for completeness, the master is also implemented. The rest of this paper is organized as follows. Section 2 gives a brief description of BNs and BN learning. Section 3 describes PSO. The baseline software implementation is described in Section 4. Section 5 describes the hardware design and implementation on a FPGA. Section 6 presents the test case, methodology, and results. Section 7 presents our conclusions.
نتیجه گیری انگلیسی
The translation of software to hardware using a FPGA can speed up many applications. In this case, PSO with BN learning is developed on a FPGA and compared to a software implementation run on a GNU/Linux cluster consisting of 24 nodes. The hardware adopts a master–slave topology similar to that of the software implementation. However, because of limited hardware resources, only one slave along with one master can be synthesized. For this reason, the fitness computations for each particle’s position are computed in serial using a single slave. Floating-point units are used for fitness and velocity computations. The use of floating-point significantly decreases the clock rate of the design. To speedup those parts of the design where floating-point computations are not needed, separate clock domains are created and implemented. When comparing the FPGA implementation to the software version run on the GNU/Linux cluster, a 2.59 times improvement in fitness evaluations per second per slave is observed. This result is promising and shows the potential impact a FPGA can have on this problem. Future work on this project will consist of implementing various methods to increase parallel processing and speed up the execution of the PSO swarm, including pipelining the floating-point functional units. By not pipelining the functional units, the clock speed is decreased significantly for the entire design and the counting process becomes a much larger bottleneck. An alternative method of replacing the floating-point units with fixed-point units could be implemented to increase the speed of execution, as well. While this would decrease the overhead cost in execution time for smaller data sets, the counting process needs to be modified for faster execution with large data sets. Ultimately, optimizing the way in which the data are handled will benefit both cases. Future work will also consist of making the design more robust by eliminating the limitation of having the training data stored in advance in BRAMs and allowing the hardware to obtain information, such as the number of nodes and maximum states, by reading in the training data from an external source. Once the data are read, the hardware could reconfigure itself to learn a BN from those data. The execution time could potentially be reduced further by parallelizing the slaves onto separate FPGAs or using a FPGA with a large enough area to fit multiple slaves such that more than one fitness can be computed at once. For every slave added in parallel, the execution time would potentially be reduced by a factor equal to the number of slaves, although this is usually difficult to achieve in practice. A network of smaller, more affordable FPGAs could be constructed and tailored for PSO where one small FPGA is assigned as the master and a network of larger FPGAs are used as slaves. This could be more cost-effective than using a cluster of GNU/Linux nodes and allow for fine-grain parallelism within each FPGA as well as coarse-grain parallelism across multiple FPGAs, resulting in a much faster, and more cost effective, BN learning process using PSO.