لایه انتزاعی چند پورتی برای برنامه های کاربردی بهره برداری حافظه های ویژه FPGA
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
20370 | 2010 | 10 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Systems Architecture, Volume 56, Issue 9, September 2010, Pages 442–451
چکیده انگلیسی
We describe an efficient, high-level abstraction, multi-port memory-control unit (MCU) capable of providing data at maximum throughput. This MCU has been developed to take full advantage of FPGA parallelism. Multiple parallel processing entities are possible in modern FPGA devices, but this parallelism is lost when they try to access external memories. To address the problem of multiple entities accessing shared data we propose an architecture with multiple abstract access ports (AAPs) to access one external memory. Bearing in mind that hardware designs in FPGA technology are generally slower than memory chips, it is feasible to build a memory access scheduler by using a suitable arbitration scheme based on a fast memory controller with AAPs running at slower frequencies. In this way, multiple processing units connected through the AAPs can make memory transactions at their slower frequencies and the memory access scheduler can serve all these transactions at the same time by taking full advantage of the memory bandwidth.
مقدمه انگلیسی
In recent years FPGA technology has evolved from being a validation framework to a computing platform. Given that the performance gap between FPGAs and ASICs has been significantly reduced [1], over the last decade ASICs have been replaced by FPGAs in some electronic industries; in the networking field, for instance, routers have FPGAs incorporated into their circuitry to minimize the time to market and related costs. FPGAs are currently being used in the field of system-on-chip design (SoC) [2] and [3] because they can now offer sufficient resources, even in some cases on-chip hardcore processors. Modern FPGA devices allow massive parallel on-chip computing through deep-pipelined data-paths with large numbers of super-scalar processing units [4] and [5]. Furthermore, many processing tasks are executed, in a fixed pattern, for a lot of data and their implementation is thus conducive to making the most of the capacity for parallelism offered by FPGAs. Unfortunately, the use of FPGAs requires more advanced hardware design skills to achieve complex systems than those needed to make the same system using GPU-based platforms. The computing platform’s performance is quite sensitive to the behavior and limitations of the memory system. Processors have traditionally used memory hierarchy schemes, in which small memories with faster access times are located close to the processors whereas larger capacity memories with slower access times are located far away from the processors [6]. As a matter of fact, data are moved from larger memories to the smaller ones based on spatial and temporal data locality principles [7] and [8]. This allows the processors faster memory access. Although these principles work well for most algorithms, if an irregular data access is required, the system’s performance will probably be significantly degraded. In fact, code optimization techniques are highly dependent on data structure [9]. As an application example, in real-time video processing systems access to all information and temporary results arrive at a bottleneck; furthermore, the use of a compression module does not entail an increase in system performance since access to compressed data is usually data-dependent and an irregular memory access must be used. Multiple parallel processing entities are possible in current FPGA devices [10], but this parallelism is forfeited when they try to access external memories. From now on in this paper, the term “external memories” will refer to all the memory chips connected to the FPGA. The inherent sequential behavior of the external memories may limit the system’s performance. Therefore, this potential bottleneck must be efficiently dealt with in high-performance systems. Access to external memories must be implemented in specific time windows when implementing massive parallel data-paths (with fine-pipelined processing structures) to avoid data collisions [11]. This task is critical and it would be useful to abstract the memory access to facilitate the design of multiple parallel entities with intensive external memory access requirements. We describe here a generic memory-control architecture designed specifically for reconfigurable hardware (FPGA devices) to be used in embedded systems. Nonetheless, because of its RTL description, the memory-control architecture can easily be adapted to an ASIC. Nowadays, the use of high-level synthesis tools is becoming usual in most academic and industrial environments. In particular, designs for FPGA devices are being made by using high-level synthesis tools to speed up the design process. Current FPGA circuits can be connected using visual box schemes such as System Generator [12], or the circuit connections may be described using C-like descriptions [13], [14] and [15], C++ based [16] etc. Over the last decade these languages and tools have been developed considerably and their use is becoming widespread. They are presented as a way in which inexperienced designers can design hardware for reconfigurable devices and also as a way of speeding up the design process. Nevertheless, although these languages accelerate the design process to an acceptable extent, access to the peripherals sometimes leaves much to be desired. In fact, memory controllers must be described using low-level synthesis tools in order to achieve better performance. The purpose of this paper is to describe a memory access control provided with a certain abstraction level capable of using the physical memory working at full capacity. The main benefits of the memory controller are the abstraction of the memory access through ports to read from and to write to, the maximization of the bandwidth at each port according to the number of open ports and its capability of being used in most embedded systems because of its low power consumption. The paper is organized as follows. Section 2 contains an overview of some related other memory controllers as well as the contexts in which they have been used. The hardware architecture of the MCU together with the results of its implementation are described in detail in Section 3. In Section 4 we assess the performance of the MCU by using it in a real system and finally, in Section 5, we offer a brief summary of the conclusions of our work.
نتیجه گیری انگلیسی
We have demonstrated the versatility of using a MCU in systems with data-dependent memory access. The MCU uses simple interfaces, which is very useful in pipelined architectures. No sophisticated access scheduler is required for integrating the MCU into systems. We have designed a MCU capable of abstracting accesses to external memories from multiple entities working concurrently. We have also implemented an efficient memory access scheduler that can grant memory accesses per clock cycle. The working high frequency of the MCU allows all the processing entities to have optimized access to the ZBT SSRAM memory. The segmented data-paths and super-scalar processing units can benefit greatly from the use of AAPs. The way in which the interfaces were made allows them to fit perfectly into segmented course architectures, thus adding high flexibility to a wide variety of systems. The image-processing system chosen for testing the MCU upholds the proof of its versatility. The switching capability to read or write per clock cycle of the MCU makes it the most interesting solution to control external memories in systems-on-chip. We have also presented a high-level abstraction layer based on Handel-C language as an example of how to integrate the MCU on C-like HDLs. Finally, we have illustrated with an image-processing example how this module can easily lead to significant performance improvements. We highlight that apart from this improvement in performance, the use of the MCU reduces design time significantly in pipelined computing architectures because designers do not have to bother with memory access scheduling.