تجزیه و تحلیل عملکرد در سطح سیستم از سیستم بر روی تراشه چندپردازنده با ترکیب مدل تحلیلی و تنوع زمان اجرا
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|28487||2014||13 صفحه PDF||سفارش دهید||10001 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Microprocessors and Microsystems, Volume 38, Issue 3, May 2014, Pages 233–245
As the impact of the communication architecture on performance grows in a Multiprocessor System-on-Chip (MPSoC) design, the need for performance analysis in the early stage in order to consider various communication architectures is also increasing. While a simulation is commonly performed for performance evaluation of an MPSoC, it often suffers from a lengthy run time as well as poor performance coverage due to limited input stimuli or their ad hoc applications. In this paper, we propose a novel system-level performance analysis method to estimate the performance distribution of an MPSoC. Our approach consists of two techniques: (1) analytical model of on-chip crossbar-based communication architectures and (2) enumeration of task-level execution time variations for a target application. The execution time variation of tasks is efficiently captured by a memory access workload model. Thus, the proposed approach leads to better performance coverage for an MPSoC application in a reasonable computation time than the simulation-based approach. The experimental results validate the accuracy, efficiency, and practical usage of the proposed approach.
With the ever-increasing complexity of embedded applications, the system complexity is also growing with an increasing number of processing elements in a single chip. Such a chip, which uses multiple processors, is called a Multiprocessor System-on-Chip (MPSoC). Hence, more communication requirements are imposed on on-chip networks, which, in turn, significantly affects the performance of an MPSoC. To cope with such a complexity, designers need to perform a system-level performance analysis in the early stages to explore various design choices before system realization. Even though simulation-based approaches are popular for estimating on-chip communication performances  and , they often suffer from lengthy run times as well as poor performance coverages owing to limited input stimuli or their ad hoc applications. Therefore, recent research on the performance analysis of embedded systems has focused on analytic or semi-formal methods to estimate the worst-case execution time (WCET) , , ,  and . In particular, in MPSoC designs, the arbitration policy is a key parameter affecting the performance over various interconnection networks such as a bus, crossbar, or Network-on-Chip (NoC). Fixed priority arbitration is still a popular choice even though it may cause a starvation problem. Related works that have addressed the performance analysis problem on bounded arbitration protocols  or unbounded ones  and  usually focus on the WCET delay for transferring a network packet or a task-level event stream. Thus, the use of such approaches for a bus transaction-level analysis may result in a severe overestimation such that every bus access undergoes the worst-case arbitration delay. As a result, they are unsuitable for soft real-time system design where the average performance is a primary concern  and . This paper proposes a system-level method to estimate the average performance distribution of an MPSoC with a bus matrix (also known as crossbar switch)-based communication architecture deploying a fixed priority arbitration. A bus matrix provides high throughput while preserving the simplicity of a shared bus abstraction. It is now widely accepted as the industrial de facto standard for on-chip communication in chip multiprocessors  as well as in MPSoCs ,  and . A bus matrix has multiple master and slave ports that are connected via multiple internal buses. Any master port can be connected to any slave port in a bus matrix. This is usually referred to as a fully connected matrix . It allows multiple accesses to different slaves in order for them to be in parallel; this results in a higher performance than conventional shared bus architectures. However, there is a scalability issue with regard to the number of master and slave ports in a single bus matrix . One way to resolve the scalability problem is to partially connect the master and slave ports, thus avoiding resource wastage due to unused bus connections. Packet-switched Network-on-Chip (NoC) architectures are becoming popular as a backbone communication infrastructure that connects processing subsystems and other hardware devices. They combine locally synchronous subsystems with a packet-switched network to build a globally asynchronous system. They may have various topologies such as mesh, ring, and tree, and the regularity of NoC can improve design productivity. The NoC architecture typically allows for a higher clock rate and provides higher communication bandwidth than the bus matrix architecture. However, it has the following disadvantages compared to the bus matrix architecture. First, it incurs the non-negligible overhead of additional buffers and control logics for converting memory transactions to packets because many IPs are still provided with on-chip bus standards ,  and . Second, packetization/depacketization incurs additional delay. Finally, the latency for delivering packets to a destination over NoC is often more unpredictable than the bus matrix-based architectures because of the complicated transaction protocol of NoC. Hence, the bus matrix architecture is a viable on-chip interconnection scheme for systems with processor of the order of several tens. For large-scale systems with hundreds of processors, typical on-chip interconnection implies a combination of bus matrix for the subsystem and NoC for backbone communication  and . Given a target application and the underlying communication architecture, the proposed technique finds a wider range of performance distribution by corner-case analysis than by a simulation-based approach in significantly less time. The proposed technique consists of two key parts: first, building an analytical model of the target system’s dynamic behavior, and second, systematically exploring, based on the model, the wider performance variations as far as possible within the affordable1 computation time. The proposed analytical model of a bus matrix architecture is based on the queuing theory and statistics of the memory access behavior of tasks. Then, it is integrated into a unified framework to enumerate task-level execution time variation of a target application, and thereby, to estimate the performance distribution with the underlying communication architecture. Because the execution time variation of tasks constitutes the huge search space of execution paths, we propose a scheme to reduce the search space by selecting a representative set of the execution times for a task. In this scheme, the execution time of a task is defined by the memory access count and access request interval. Experimental results validate the proposed technique. First, our analytical model robustly and accurately predicts the execution time of a target application on various bus matrix architectures. In comparison with the simulation-based approach, the time taken for our analysis is an order of magnitude shorter. Furthermore, the estimated performance on average is 95% accurate. Second, the proposed technique defines a wider performance range than the simulation-based approach along with a faster analysis time by significant orders of magnitude. Experiments over various bus matrix architectures show that the performance ranges obtained by the simulation-based approach lie within the range obtained by the proposed technique. The performance range gap between the two approaches is about 21% on average in terms of the worst/best execution time, which is an acceptable overestimation for practical use. However, it is worth noting that the proposed technique does not guarantee the worst-case performance. In the next section, we review related work and state our contributions. The overview of the proposed analysis framework is presented in Section 3. Section 4 explains the analytical model of on-chip communication architectures using the queuing theory. Then, in Section 5, the system-level performance analysis technique based on the analytical model is introduced. Experiment results on the accuracy and efficiency of the proposed approach are provided in Section 6. Finally, Section 7 presents the conclusions and addresses future work.
نتیجه گیری انگلیسی
In this paper, we presented a novel system-level performance analysis technique for the performance distribution of an MPSoC. The analytical on-chip communication architecture model was proposed, which was then integrated into the execution path enumeration framework. To model the variation in memory access traffics, the selected combinations of request intervals and access counts are considered. The experimental results with the 6-ch DVR example and the synthetic applications showed that our analytical model is very accurate under the various bus matrix architectures and on-chip communication traffic. The proposed execution path enumeration resulted in better performance coverage compared to a time-consuming random simulation approach in less computation time by significant orders of magnitude. As future work, we plan to extend the analytical model to other arbitration policies and more complicated interconnection networks such as cascaded bus matrices or NoCs. The optimization to reduce the search space of the path enumeration while not sacrificing the diversity remains for future work.