بهره برداری چند هدفه از خط لوله موازی با استفاده از خوشه بندی ، تکرار و تقلید در سیستم های چند هسته ای جاسازی شده
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
20478 | 2013 | 12 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Systems Architecture, Volume 59, Issue 10, Part C, November 2013, Pages 1083–1094
چکیده انگلیسی
With the popularity of mobile device, people require more computing power to run emerging applications. However, the increase in power consumption is a major problem because power is quite limited in embedded systems. Our goal is to consider power consumption along with latency and throughput. We proposed a heuristic algorithm, called Parallel Pipeline Latency Optimization for high performance embedded systems (PaPiLO), based on clustering, replication and duplication, to minimize latency under power and throughput constraints. Experimental results show our method can get 15% latency reduction and 10% improvements for random task graphs and MPEG-2 decoder, respectively.
مقدمه انگلیسی
Nowadays, embedded systems have become ubiquitous in our daily life. Since embedded processors need more computing power to process emerging applications, including network, digital signal processing, and multimedia, embedded multicore processors were introduced. For example, Freescale 8641D (dual-core), ARM Cortex-A9 MPCore (supporting 2 to 4 cores), Plurality’s Hypercore processor (capable of supporting 16 to 256 cores), and Tilera’s TilePro64 processor (64 cores) are state-of-the-art processors [2]. It is believed that embedded multicore systems will become a mainstream in the near future. A recent example is that the launch of the first quad-core smartphone was officially announced. From the programmers’ perspective, parallel programming plays a key role in the multicore era. That is, using the appropriate parallel design patterns for the target application can affect the performance of multicore processors. Three common parallel programming paradigms investigated in previous studies include data parallelism, task parallelism, and pipeline parallelism [3], [4], [5] and [6]. Data parallelism takes a given data set and applies the same task to data items in parallel. Task parallelism takes a sharing data set and applies different tasks to each data item of the data set in parallel. Since it is hard to exploit only pure data parallelism or pure task parallelism, pipeline parallelism is considered as an emerging paradigm which uses data and task parallelism in a specific data flow pattern. Pipelined workflows can be classified to single-path and multi-path pipelined workflows. As shown in Fig. 1, a multi-path pipelined workflow can be considered as a combination of several single-path pipelined workflows. Both of them illustrate the task dependence graphs, each node of which is represented as a stage. Compared to the single-path pipelined workflows, the multi-path pipelined workflows can exploit task parallelism naturally. However, multi-path pipelined workflows are more sophisticated to tackle than single-path ones since there are a larger combinatorial space to explore while searching for feasible and efficient solutions. In this paper, we assume our target applications as multi-path pipelined workflows. Full-size image (20 K) Fig. 1. The single-path and multi-path pipelined workflows. Figure options To measure performance of a given pipelined workflow, latency and throughput are typical performance related metrics. Latency is the maximum time a data item spends in the pipelined workflow, and throughput is the number of data items processed per time unit. To process data items in parallel, two basic approaches are commonly used. One approach uses data parallelism to process one single data item in parallel, that is, a data item is cut into small pieces of data, each of which is processed by the corresponding core. The other approach uses task and pipeline parallelism to process multiple data items in parallel. For task parallelism, the same data are processed by inter-independent tasks concurrently. As to pipeline parallelism, different segments of data are processed concurrently in multiple stages which are multiple tasks. The first approach can achieve minimal latency, while the second approach can increase the throughput but worsen the latency. Tradeoff between latency and throughput can be explored by using parallel loop (data parallelism) to reduce latency, parallel tasks (task parallelism) to increase throughput, and parallel pipeline (pipeline parallelism) to hide latency [7]. In addition, the increment of power consumption is also a significant problem, especially for embedded systems. In this paper, we investigate not only the performance optimizations but also the influence of power consumption on the above mentioned three types of parallelism for multi-path pipelined workflows. In order to meet performance and power consumption requirements for a given pipelined application, we exploit three parallel design patterns: clustering, replication, and duplication. Clustering can avoid heavy communication costs by mapping two or more communicating tasks to the same core. Since clustering does not require extra cores to process tasks in parallel, it can reduce the power consumption; however, it reduces the throughput and increases the latency due to the reduced task parallelism. Fig. 2 illustrates a motivating example, the core allocation, and the values of latency, throughput, and power consumption, without using any patterns. Fig. 3, Fig. 4 and Fig. 5 are examples using clustering, replication, and duplication, respectively. We have an application which can be represented as a pipelined workflow, two data items to be processed, and three cores. This application has three stages, namely t1,t2t1,t2 and t3t3, whose execution times are 12, 5 and 6 units of time, respectively, and has three communication links between each pair of consecutive stages, namely d1,2,d1,3d1,2,d1,3 and d2,3d2,3, whose communication times are 10, 7 and 3 units of time, respectively. To estimate the power consumption, we use a metric, denoted as α, which is defined as the ratio of the power consumed for inter-core communication and the power consumed for computation on a core. The value of α depends on the target platform. Here we assume the value of α is 0.5 for calculating the power consumption in the motivating examples. Full-size image (24 K) Fig. 2. A pipelined workflow example without using clustering, replication, and duplication. Figure options Full-size image (25 K) Fig. 3. A pipelined workflow example using clustering. Figure options Full-size image (26 K) Fig. 4. A pipelined workflow example using replication. Figure options Full-size image (24 K) Fig. 5. A pipelined workflow example using duplication. Figure options The core dispatch strategy of tasks is assume to allocate each cluster on an independent core. Therefore, as shown in Fig. 3, t1t1 is clustered with t2t2 to avoid the large communication cost in d1,2d1,2 so that it reduces the latency. Replication can be used to increase throughput by adding replicas of some tasks. Since throughput is limited by data transfer rate, tasks could be replicated to process the next data items if there are other available cores, and this would double the aggregate transfer rate. Take Fig. 4 for example, we replicate the cluster of t1t1 and t2t2 since there are an extra core C2C2. Different from replication, duplication can alleviate communication costs by allocating the copies of tasks to extra cores. However, a task and its duplicates receive and execute the same input data item. Fig. 5 shows an example that t1t1 is duplicated so that we can hide the communication cost of d1,3d1,3. Compared to the examples using clustering and replication, we get better latency while using duplication in this situation. However, although duplication can reduce latency, it may worsen the power consumption due to the execution of the redundant copies of tasks on extra cores. From the results, we can observe that compared to the original schedule in Fig. 2, clustering can lower the power consumption. Replication can improve the throughput but worsen the latency. Duplication can minimize the latency but consumes more power. Since these three parallel design patterns have their own advantages and disadvantages according to the specific metrics, it is critical to appropriately exploit them to achieve a multi-objective approach. As illustrated in the previous section, it is unfortunate that the goals of high performance and low power consumption are conflicting. Therefore, it is impossible for one single parallel design pattern to achieve both high performance and low power consumption. In fact, throughput and/or latency can be optimized by using more energy consumption to increase the frequency of cores, however, power consumption is quite limited for embedded systems. We advocate to find a multi-objective pipelined workflow schedule under a given power budget. In this paper, we target at optimizing three contradictory objectives for a given pipelined workflow application, namely latency, throughput, and power consumption. According to priority, power consumption is our primary concern, and throughput is the secondary concern. Without violating the primary and secondary constraints, we minimize the latency. Consequently, we make the following contributions in this paper: • We prioritize the objectives based on embedded multicore systems. • We adopt the parallel design patterns which are common in designing multicore applications for a programmer. • We propose a scheduling algorithm of minimizing latency under power and throughput constraints for applications in embedded multicore systems. The rest of the paper is organized as follows. In Section 2, we introduce related work and the current state-of-the-art in pipelined workflow scheduling. In Section 3, the task model and the system model are defined. Some assumptions are also given here. Section 4 is the core of this paper, we introduce our proposed heuristic method to achieve the multi-objective approach using clustering, replication, and duplication. Section 6 gives the experimental results. Finally, we conclude the paper and direct future work in Section 7.
نتیجه گیری انگلیسی
In this work, we proposed a heuristic method, called Parallel Pipeline Latency Optimization for high performance embedded systems (PaPiLO), to achieve a multi-objective approach for a pipelined workflow. By exploiting three basic parallel design patterns, namely clustering, replication, and duplication, we try to optimize the latency under the predetermined power and throughput constraints. Compared to two previous works, namely Workflow Mapping and Scheduling Heuristic (WMSH) and Throughput Constrained Latency Optimization (TCLO), and the modified version of TCLO, called Throughput Constrained Latency Optimization, with Power Constraint Consideration (TCLO-p), experimental results showed that our PaPiLO method can get 61% more feasible solutions than the TCLO method, and get around 15% latency reduction against the WMSH method, for a set of 179 random task graphs. Moreover, we defined a metric called Quality to validate our proposed method, and experimental results also showed that the PaPiLO method always exhibits the highest quality. In addition, we presented a real-world example, MPEG-2 decoder, and showed that PaPiLO can averagely get around 10% latency reduction against WMSH, TCLO, and TCLO-p. However, since we only consider the communication-aware power reduction in our proposed PaPiLO, other adaptive computing capabilities of multicore processors would provide a greater chance to satisfy stricter power requirements. For example, the Dynamic Voltage and Frequency Scaling (DVFS) technology is a good power/energy saving solution. In addition, with the