روش رمز گشایی برای بهره برداری ILP در پردازنده های جاوا
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|20329||2008||22 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Systems Architecture, Volume 54, Issue 7, July 2008, Pages 707–728
Java processors have been introduced to offer hardware acceleration for Java applications. They execute Java bytecodes directly in hardware. However, the stack nature of the Java virtual machine instruction set imposes a limitation on the achievable execution performance. In order to exploit instruction level parallelism and allow out of order execution, we must remove the stack completely. This can be achieved by recursive stack folding algorithms, such as OPEX, which dynamically transform groups of Java bytecodes to RISC like instructions. However, the decoding throughputs that are obtained are limited. In this paper, we explore microarchitectural techniques to improve the decoding throughput of Java processors. Our techniques are based on the use of a predecoded cache to store the folding results, so that it could be reused. The ultimate goal is to exploit every possible instruction level parallelism in Java programs by having a superscalar out of order core in the backend being fed at a sustainable rate. With the use of a predecoded cache of 2 × 2048 entries and a 4-way superscalar core we have from 4.8 to 18.3 times better performance than an architecture employing pattern based folding.
Java with its simple object model and its platform neutrality has become popular in a wide range of applications, ranging from small embedded systems to high performance servers. However, the existence of a virtual machine between the software application and the host software platform degrades dramatically the speed of execution. The more naive method to execute Java bytecodes is to interpret them. However, there are several techniques ,  and  that are much more effective, and raise performance substantially. In environments with general purpose processors, JIT  and adaptive compilation techniques dominate. These techniques profile method execution and dynamically compile the most often ones. The overhead of the compilation is amortized in short time. Although this technique is dominant in PC and server environments, it cannot be applied to embedded systems, because it demands considerable amount of memory and processing speed. Java processors  and  were adopted to accelerate Java applications on these devices. These processors execute the Java bytecodes directly in hardware. However, they do not exploit instruction level parallelism, because they are mainly stack machines. The serialization incurred from the stack, poses significant limitations to the instruction parallelism. PicoJava  and  tries to eliminate stack accesses by folding groups of instructions to a RISC like operation. More complex folding methods exist, as OPEX ,  and , which performs nested folding, compared to picoJava which only performs folding based on patterns. OPEX removes the stack completely by performing nested folding in the decode stage of the processor. The decoding unit is coupled with the execution unit. OPEX processor folds instructions in the execution path. Its main drawback is that the folding rate that it achieves is not sufficient to feed a superscalar backend core , which may make use of all the available instruction level parallelism. Another approach  tries to extract traces of instructions that are independent and can be executed in parallel. The detection of the traces is performed by finding the clean stack points, namely the points where the stack consists of zero elements. Obviously the instructions that follow such a point are not dependent on the previous, as they do not use any value from the stack. The main drawback of this method is that the marking of the traces is performed in software, during the verification stage. In our work, we propose a hardware folding technique that dynamically transforms Java bytecodes groups into RISC instructions, storing them in a cache so as to enable reuse. The core folding algorithm is based on OPEX, modified appropriately to make reuse feasible. The method is like a simple hardware JIT compilation method, but in a fine grained form. It compiles blocks of code between successive clean stack points, and stores the RISC products in a cache for later use. The cache can provide the execution unit at a very high rate, not letting it to starve. The technique has the benefit over normal JIT compilation, that we do not experience the latency of compiling a whole method. The fine grain translation is done in few cycles, so the user does not experience high start up time, something common in embedded JIT enabled virtual machines. Additionally, the memory requirements are fixed (only the predecoded cache is used) as opposed to JIT solutions which usually result in memory bloat. Furthermore, the translation is seamless to the user. It is a dynamic optimization technique in hardware. The idea of caching the folding results has appeared in  as an extension to picoJava II processor. In combination with a stack disambiguation technique that differentiates operand stack accesses from local variable accesses, it has made feasible some level of parallelism. However, the folding is pattern based, and the architecture does not permit out of order execution. Our architecture aims at full exploitation of the instruction level parallelism in Java programs. We do so by removing completely the stack by folding even nested folding groups. Due to the limited folding throughput, we store the folding results in a predecoded cache, so as to provide a superscalar backend core at an efficient rate . Our folding algorithm is based on OPEX, modified appropriately to enable reuse of the folding results. As a variation we place the folding unit out of the critical path of the processor, so as not to experience great initial latencies. Having the folding in the fill side of the processor, more sophisticated techniques could be applied. A straightforward modification is to combine many predecoded cache lines into a trace cache entry, at the expense of greater complexity. We have used specjvm98  benchmarks for our evaluation and we have investigated our technique for different cache configurations. Assuming a 4-way superscalar out of order core as the execution unit and microcode interpretation of the complex object oriented Java bytecodes, we have evaluated the performance of our architecture in terms of Java bytecodes per cycle and RISC instructions per cycle. With a predecoded cache of 2 × 2048 entries we have seen performance of 4.42 Java bytecodes per cycle with a corresponding RISC throughput of 3.19 RISC instructions per cycle. The pure decoding throughput of this architecture reaches 6.39 Java bytecodes per cycle, which compared to the corresponding rate of an architecture with pattern based folding (1.15), constitutes a 453% speedup. The architecture with the folding unit in the fill side of the processor performs slightly worse, but it has the real benefit of relaxing the hardware design of the folding unit. The incorporation of a trace cache is expected to enhance the performance considerably.
نتیجه گیری انگلیسی
This paper proposes a recursive stack folding technique which keeps the folding results in a cache to allow reuse and great decoding throughput. Two ways of incorporating this mechanism were explored, ARCH1 and ARCH2. We have explored different cache configurations and we have seen that the proposed architectures perform very well in computational benchmarks like mpeg and compress. ARCH1 has been observed to perform from 4.8 to 18.3 times better than pattern based folding. ARCH2 which is slower than ARCH1 performs from 4.7 to 16 times better than pattern based folding in a 2×2k2×2k cache configuration, but its performance falls dramatically for smaller caches. However, it is more flexible for hardware realization, since it puts the folding unit in the fill side of the processor. High number of auxiliary RISC instructions have been reported throughout our experiments. These are partially due to arbitrary splits. Increasing the cache line size reduces their percentage, since no arbitrary splits occur, but it demands very large caches which will suffer from very low utilization. A possible future enhancement is the incorporation of a trace cache, which will unite predecoded cache lines to form execution paths. PUSH and POP instructions of two adjacent predecoded cache lines can be factored out, thus decreasing the total RISC instruction count and yielding performance gains in terms of JIPC.