افزایش موازی داده ها برای بهینه سازی کلونی مورچه در واحد پردازش گرافیک
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|7844||2013||10 صفحه PDF||سفارش دهید||7120 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Parallel and Distributed Computing, Volume 73, Issue 1, January 2013, Pages 42–51
Graphics Processing Units (GPUs) have evolved into highly parallel and fully programmable architecture over the past five years, and the advent of CUDA has facilitated their application to many real-world applications. In this paper, we deal with a GPU implementation of Ant Colony Optimization (ACO), a population-based optimization method which comprises two major stages: tour construction and pheromone update. Because of its inherently parallel nature, ACO is well-suited to GPU implementation, but it also poses significant challenges due to irregular memory access patterns. Our contribution within this context is threefold: (1) a data parallelism scheme for tour construction tailored to GPUs, (2) novel GPU programming strategies for the pheromone update stage, and (3) a new mechanism called I-Roulette to replicate the classic roulette wheel while improving GPU parallelism. Our implementation leads to factor gains exceeding 20x for any of the two stages of the ACO algorithm as applied to the TSP when compared to its sequential counterpart version running on a similar single-threaded high-end CPU. Moreover, an extensive discussion focused on different implementation paths on GPUs shows the way to deal with parallel graph connected components. This, in turn, suggests a broader area of inquiry, where algorithm designers may learn to adapt similar optimization methods to GPU architecture.
Ant Colony Optimization (ACO)  is a population-based search method inspired by the behavior of real ants. It may be applied to a wide range of problems  and , many of which are graph-theoretic in nature. It was first applied to the Traveling Salesman Problem (TSP)  by Dorigo et al.  and . In essence, simulated ants construct solutions to the TSP in the form of tours. The artificial ants are simple agents which construct tours in a parallel, probabilistic fashion. They are guided in this task by simulated pheromone trails and heuristic information. Pheromone trails are a fundamental component of the algorithm, since they facilitate indirect communication between agents via their environment, a process known as stigmergy . For additional details about these processes, we refer the reader to . ACO algorithms are population-based, that is, a collection of agents “collaborates” to find an optimal (or at least satisfactory) solution. Such approaches are suited to parallel processing, but their success strongly depends on the nature of the particular problem and the underlying hardware available. Several parallelization strategies have been proposed for the ACO algorithm on shared and distributed memory architecture ,  and . The Graphics Processing Unit (GPU) is a topic of significant interest in high performance computing. For applications with abundant parallelism, GPUs deliver higher peak computational throughput than latency-oriented CPUs, thus offering a tremendous potential performance uplift on massively parallel problems . Of particular relevance to us are attempts to parallelize the ACO algorithm on GPUs. Until now, these approaches have focused on accelerating the tour construction, step performed by each ant by taking a task-based parallelism approach, with pheromone deposition on the CPU ,  and . In this paper, we present the first fully developed ACO algorithm for the Traveling Salesman Problem (TSP) on GPUs, where both stages are parallelized: tour construction and pheromone update. A data parallelism approach, which is better suited to the GPU parallelism model, is used to enhance performance on the first stage, and several GPU design patterns are evaluated for the parallelization of the second stage. Our major contributions include the following: 1. To the best of our knowledge, this is the first data-parallelism scheme on GPUs for the ACO tour construction stage. Our design proposes two different types of virtual ants: Queen ants (associated with CUDA thread-blocks), and worker ants (associated with CUDA threads). 2. We introduce an I-Roulette method (Independent Roulette) to replicate the classic roulette wheel selection while improving GPU parallelism. 3. We discuss the implementation of the pheromone update stage on GPUs, using either atomic operations or other GPU alternatives. 4. We offer an in-depth analysis of both stages of the ACO algorithm for different instances of the TSP problem. Several GPU parameters are tuned to reach a speed-up factor of up to 21× for the tour construction stage, and a 20× speed-up factor for the pheromone update stage. 5. The solution accuracy obtained by our GPU algorithms is compared to that of the sequential code given in  and extended using TSPLIB. The rest of the paper is organized as follows. We briefly introduce Ant Colony Optimization for the TSP and Compute Unified Device Architecture (CUDA) from NVIDIA in Section 2. In Section 3 we present GPU designs for the main stages of the ACO algorithm. Our experimental methodology is outlined in Section 4 before we describe the performance evaluation of our algorithm in Section 5. Other parallelization strategies for the ACO algorithm are described in Section 6, before we summarize our findings and conclude with suggestions for future work.
نتیجه گیری انگلیسی
Ant Colony Optimization (ACO) belongs to the family of population-based metaheuristics that has been successfully applied to many NP-complete problems. We have demonstrated that task parallelism used by previous implementation efforts does not fit well on GPU architecture, and, to overcome this issue, an alternative approach based in CUDA and data parallelism is provided. This approach enhances the GPU performance by increasing parallelism and avoiding warp divergence, and, combined with an alternative selection procedure that fits better to GPUs, leads to performance gains of more than 20× compared to sequential CPU code executed on a multicore machine. For the pheromone update stage, we are the first to present a complete GPU implementation, including a set of strategies oriented to avoid atomic instructions. We identify potential tradeoffs and investigate several alternatives to sustain gains over 20× in this stage as well. An extensive validation process is also carried out to guarantee the quality of the solution provided by the GPU, and accuracy issues concerning floating-point precision are also investigated. ACO on GPUs is still at a relatively early stage, and we acknowledge that we have tested a relatively simple variant of the algorithm. But, with many other types of ACO algorithm still to be explored, this field seems to offer a promising and potentially fruitful area of research. On the hardware side, it is expected to get even higher accelerations on GPUs whenever the problem size keeps growing and larger device memory space is available. Moreover, we may anticipate that the benefits of our approach would also increase when using future GPU generations endowed with thousands of cores and eventually grouped into GPU clusters to lift performance into unprecedented gains where parallelism is called to play a decisive role.