تحقیقات عملکرد الگوریتم های ژنتیکی بر روی کارت های گرافیکی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|8150||2013||15 صفحه PDF||سفارش دهید||10300 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Swarm and Evolutionary Computation, Available online 15 April 2013
Genetic algorithms are one of the most adaptable optimization algorithms. Due to their inherent parallelism they seem well suited for the execution on massively parallel hardware such as graphics processing units. In this paper we put this claim to the test by performing comprehensive experiments. We try to find out how well graphics processing units are suited for the task and what parts of genetic algorithms should be executed on them. We focus especially on the new Fermi generation of Nvidia graphics chips. While it is imperative the fitness function be effectively parallelizable on the GPU, because it is the most computational expensive task of the algorithm, results indicate that if this is the case, speedups of several orders of magnitude are possible compared to conventional multi-core CPUs. Our findings also suggest that, starting with the Fermi architecture, all parts of a genetic algorithm should be carried out on the graphics card instead of only part of it.
Evolutionary algorithms (EAs), especially genetic algorithms (GAs)  and , are a popular form of metaheuristic optimization algorithms. GAs maintain a population of individuals, also called candidate solutions. The internal representation of such a candidate solution is called a chromosome or genotype. The most popular representations are in form of strings of floats, integers or bits (the so-called genes). A fitness function is used to assign each individual a fitness value, indicating the quality of a certain candidate solution. In general, the algorithm repeats the following steps iteratively until a termination condition, for example reaching a fitness threshold, is met: (i) Based on their quality, parent individuals are chosen from the population. (ii) During variation, the crossover operator derives a number of offspring from the parent population; then, the resulting offspring undergo mutation, which introduces random change into their genotypes. (iii) The newly created offspring are evaluated using the fitness function and, depending on the survivor selection strategy, replace less promising individuals to make up the next generation. Although GAs are very efficient for some kind of optimization problems, they can still be resource-demanding for large problems, evoking the need for efficient parallelization. A vast number of different parallelization strategies can be found in the literature  and , the most intuitive of which is the master–worker model: a master instance runs a GA and manages the population as a whole. Whenever a (parallelizable) computational expensive task arises, the master divides this task and distributes the fractions to the workers. It could come to mind to parallelize every stage in the evolutionary cycle because most tasks can be done in parallel: • Because solutions in the initial population are seeded independently from each other, the task can be performed in parallel. • Each fitness evaluation takes only one individual into account, thus the fitness of the entire population can be computed in parallel. • Since most selection schemes work probabilistically and perform random draws from the population, and even encourage multiple draws of the same individual, the draws can be done in parallel. • Variation operators (crossover and mutation) work on a finite set of genotypes (two in the case of two-parent-crossover, one in the case of mutation) and there exist no dependencies between them, so they can be performed in parallel. While the first implementations of parallel GAs were designed for multi-core CPUs, clusters, and grids (e.g. ParadisEO ), with today's graphics processing unit (GPU) there exists another parallel architecture, offering a high degree of parallelism and huge amounts of computational power. As a consequence, it seems natural to employ GPUs for parallel GAs. In 2005, Wong et al. deduced that GAs are only partly suited for GPUs . This was concluded because the crossover operator examined did not operate on each gene independently, but on the whole chromosome, which could not be implemented efficiently on early GPU architectures. In 2009, Maitre et al. decided to take another look at running GAs in the master–worker model on the GPU . They found that a master–worker scheme, in which the GPU only evaluated the fitness function, had never been studied, perhaps due to the overhead of data transfer between CPU and GPU required by such a model. Their results, using benchmark functions and a real-world problem of material science, yielded a speedup of around 100 for their master–worker algorithm. They extended EASEA (EAsy Specification for Evolutionary Algorithms) , a language designed specifically to help non-expert programmers to make use of EAs, to produce code which can be directly compiled using the CUDA architecture. GPU-realizations of other GA parallelization schemes than the master–worker scheme can be found in the literature ,  and . Most of the work up to now (e.g. ,  and ) focuses on implementing EAs on older GPUs; even using the previous generations' graphics hardware, some claim to achieve speedups higher than 1000 when comparing their GPU implementations to normal CPUs. Speedups of this magnitude can only be achieved by comparing optimized GPU code to poor CPU implementations. By examination of the existing work it is revealed that all use sequential implementations, which were in no way optimized to take advantage of modern multi-core CPUs (i.e. no multi-threading or vectorization). The main contribution of our work consists firstly in an evaluation of the most recent Fermi GPU architecture and secondly in a fair comparison of this new architecture to recent multi-core CPUs by using an optimized CPU implementation. In this work we present comprehensive experimental results to obtain answers to the following questions: • Are GPUs suited for the master–worker scheme? • How can GAs benefit from the current Fermi architecture of Nvidia GPUs in comparison to the previous G200 architecture? • What speedup can be gained with the GPU in comparison to today's high-end multi-core CPUs? • Which parts of the algorithm should be run on the GPU and which on the CPU? For this purpose we ran experiments for two test problems. The first one is a mathematical optimization problem often used for benchmarking optimization algorithms: the minimization of the Weierstrass function. The second one is a combinatorial problem which has many real-world applications: the traveling salesman problem (TSP). We compare the performance of the individual parts of GAs for these two problems on the following architectures: • AMD Opteron 2435 CPU using only one core. • AMD Opteron 2435 CPU using all six cores. • Intel Xeon X5650 CPU using only one core. • Intel Xeon X5650 CPU using all six cores. • Nvidia G200 GPU. • Nvidia Fermi GPU. This paper is organized as follows. The next section briefly describes the G200 and Fermi architectures of GPUs. In 3 and 4, the experiments for the Weierstrass function and the TSP are explained and their results are presented respectively. The last section gives a conclusion of the work.
نتیجه گیری انگلیسی
The ultimate goal of this work was to examine genetic algorithms for their suitability to be run on today's GPUs. In all of our use cases, the operators associated with evolutionary search, especially the standard operators, showed promising results, when parallelized on the GPU, which is why we can make the general recommendation to run genetic algorithms on the GPU. In particular the improvements of the Fermi architecture compared to older architectures affect the performance of parallel genetic algorithms very well: while for some operators (e.g. seeding, selection) the old G200 architecture was slower than the CPU, the new Fermi chipset was at least able to match or in most cases even best today's multi-core CPUs. Population seeding, parent selection, fitness evaluation, and mutation, all operate on a single individual. Since there are no dependencies between individuals in these operators, the tasks can be considered embarrassingly parallel. They can be easily implemented on the GPU architecture by using an individual-to-thread-mapping. The same applies to the crossover operator, the only difference being that it handles two individuals per instance. We have also shown that operations pertaining to population management can easily be implemented using the Thrust library. Thus, we consider running all parts of the algorithm on the GPU as advisable. However, we can only give a general recommendation, because, each genetic algorithm has to embed operators that are problem-dependent. We have seen in the real-world application, that the final success of parallelization is highly dependent on these operators: there was no way to find an appropriate solution for the divergence problem in the TSP greedy 2opt operator, thereby rendering the algorithm unfit for the GPU. Thus, we conclude with the following two statements: (i) the population-based approach of evolutionary algorithms makes them very easy to parallelize, and the parallelization is generally very effective; however (ii) before implementing an algorithm, the fitness function and operators embedding expert knowledge should come under thorough scrutiny. This is necessary to determine whether there is a way to map them effectively onto the GPU architecture to obtain good results, because they make up most of the algorithm runtime.