بهبود محاسبه WCET در حضور دستور العمل قابل قفل شدن حافظه نهان در عملکرد چند تکلیفی سیستم های زمان واقعی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|7269||2011||12 صفحه PDF||سفارش دهید||9950 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Systems Architecture, Volume 57, Issue 7, August 2011, Pages 695–706
In multitasking real-time systems it is required to compute the WCET of each task and also the effects of interferences between tasks in the worst case. This is very complex with variable latency hardware, such as instruction cache memories, or, to a lesser extent, the line buffers usually found in the fetch path of commercial processors. Some methods disable cache replacement so that it is easier to model the cache behavior. The difficulty in these cache-locking methods lies in obtaining a good selection of the memory lines to be locked into cache. In this paper, we propose an ILP-based method to select the best lines to be loaded and locked into the instruction cache at each context switch (dynamic locking), taking into account both intra-task and inter-task interferences, and we compare it with static locking. Our results show that, without cache, the spatial locality captured by a line buffer doubles the performance of the processor. When adding a lockable instruction cache, dynamic locking systems are schedulable with a cache size between 12.5% and 50% of the cache size required by static locking. Additionally, the computation time of our analysis method is not dependent on the number of possible paths in the task. This allows us to analyze large codes in a relatively short time (100 KB with 10651065 paths in less than 3 min).
Real-time systems require that tasks complete their execution before specific deadlines. Given hardware components with a fixed latency, the worst case execution time (WCET) of a single task could be computed from the partial WCET of each basic block of the task. However, in order to improve performance, current processors perform many operations with a variable duration. This is mainly due to speculation (control or data) or to the use of hardware components with variable latency. Branch predictors fall in the first category, whereas memory hierarchy and datapath pipelining belong to the second one. A memory hierarchy made up of one or more cache levels takes advantage of program locality and saves execution time and energy consumption by delivering data and instructions with an average latency of a few processor cycles. Unfortunately, the cache behavior depends on past references and it is required to know the previous accesses sequence in order to compute the latency of a given access in advance. Resolving these intra-task interferences is a difficult problem its own. Anyway, real-time systems usually work with several tasks which may interrupt each other at any time. This makes the problem much more complex, since the cost of inter-task interferences must also be identified and bounded. Furthermore, both these problems cannot be accurately solved independently, since the path that leads to the WCET of an isolated task may change when considering interferences. Cache-locking tackles the whole problem by disabling the cache replacement, so the cache content does not vary. Specifically, for an instruction cache, the instruction fetch hits and misses depend on whether each instruction belongs to a cached and locked memory line and not on the previous accesses. In this paper, we focus on the instruction fetch path and analyze several configurations of the memory architecture shown in Fig. 1. It consists of a line buffer (LB) and a lockable instruction cache (i-cache). The i-cache retains the fixed subset of instruction lines previously loaded by system software at task switches. It does not need fine-grained locking, but whole cache-locking. The LB has the size of a cache line and acts as a single-line cache memory regarding the tag, access latency and refill latency. The only difference with a conventional cache is that it prevents exploiting any temporal locality by invalidating its content on both i-cache hits and backward jumps to the currently buffered line. Instruction cache-locking can be implemented in several ways. One way is to design and synthesize a specific organization taking advantage of its control simplicity, since no replacement algorithm needs to be implemented and the refill happens always in bursts. Another way is making use of the locking capabilities present in many commercial processors devoted to the medium and high-end embedded market.2We propose a new method intended to minimize the instruction cache contribution on the WCET. Previous works studying cache-locking behavior do not distinguish between spatial and temporal locality, so it is not clear how either of these affects the WCET. In order to evaluate the importance of spatial locality we first analyze an instruction line buffer (LB) working alone. Second, to analyze the impact on the WCET of exploiting temporal locality, we add an instruction cache, managed under static and dynamic locking, to the LB. In static locking, memory lines are preloaded at system start-up and remain unchanged during the whole system lifetime. We use the Minimize Utilization (Lock-MU) method for selecting the memory lines to be locked into the instruction cache . In dynamic locking, the instruction cache is preloaded and locked in each context switch, so that there is one selection of memory lines per task. To obtain this selection of memory lines we propose Maximize Schedulability (Lock-MS), a new ILP-based method that considers both intra-task and inter-task interferences. That is, we get the selection of memory lines that provides the lowest overall execution cost (including preloading times) when used in a dynamic cache-locking multitasking system. We show how to model the system with easy to understand path-explicit constraints and then how to transform them into a compact model, which can be solved much faster. This paper is organized as follows. In Section 2, we review the background and related work. Section 3 presents our path-explicit method for selecting the lines to be locked into the instruction cache. The model compaction is described in Section 4. Section 5 shows experiments comparing several selection procedures, memory architectures and analysis times. Finally, Section 6 presents our conclusions.
نتیجه گیری انگلیسی
In this paper, we have proposed an ILP-based method (Lock-MS) to obtain the selection of memory lines to be loaded and locked into an instruction cache at each context switch in multitasking real-time systems. This selection takes into account the requirements of each single task and also the effects of interferences between tasks. Additionally, we show that the description of the model can be compacted when the number of possible paths grows. The combination of both the path-explicit and compact model is also possible, which is interesting for including high-level control flow information. To isolate the benefits of the instruction cache we analyze first three cacheless options: either fetching directly from the eSRAM, from a line buffer or from an ideal system providing single-cycle fetch. Compared to the direct eSRAM fetch, the line buffer reduces processor utilization by almost 50% at a very low cost, and we take this option as a suitable baseline for the following experiments. We also compare Lock-MS, which provides a per-task memory line selection, with an existing static locking approach (Lock-MU) for an instruction fetch architecture having a line buffer and a cache. Lock-MS performs better than Lock-MU for cache sizes under 40% of the code size, improving schedulability with small caches. With larger caches, most or all the highly accessed code can be cached and static-locking performs better, since it does not suffer from cache reloading penalties. Our results also show that our approach is not sensitive to the cache configuration (sets vs. ways) but to the total cache size, being able to successfully exploit direct mapped caches. In order to show that Lock-MS is computationally feasible we have designed a collection of synthetic tasks designed specifically to be large and difficult to analyze, using them in a set of large caches. We find that the solver obtains a very good solution in a very short time, and the remaining time is spent discarding very similar solutions. To assess the quality of the first integer solutions we use the real solution, showing that it is optimal in around 25% of cases and, in the rest of cases, no more than 0.45% larger than the real solution, which can be taken as an overestimation bound. Using the compact Lock-MS representation we show that the analysis time grows approximately in a quadratic way with respect to the problem complexity (essentially the task size), with the number of paths in the task having very little impact. This allows us to analyze large codes in a relatively short time (100 KB with 10651065 paths in less than 3 min). Finally, comparing Lock-MS with a worst-case conventional cache analysis, both options perform in a similar way in WCET. Regarding the response time in multitask, Lock-MS performs better and has higher predictability.