طراحی و تجزیه و تحلیل سیاست های مدیریت حافظه استاتیک برای دسترسی حافظه پنهان به حافظه غیر یکنواخت چند پردازنده منسجم
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
15309 | 2002 | 22 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Systems Architecture,, Volume 48, Issues 1–3, September 2002, Pages 59-80
چکیده انگلیسی
In this paper, we characterize the performance of three existing memory management techniques, namely, buddy, round-robin, and first-touch policies. With existing memory management schemes, we find several cases where requests from different processors arrive at the same memory simultaneously. To alleviate this problem, we present two improved memory management policies called skew-mapping and prime-mapping policies. By utilizing the properties of skewing and prime, the improved memory management designs considerably improve the application performance of cache coherent non-uniform memory access multiprocessors. We also re-evaluate the performance of a multistage interconnection network using these existing and improved memory management policies. Our results effectively present the performance benefits of different memory management techniques based on the sharing patterns of applications. Applications with a low degree of sharing benefit from the data locality provided by first-touch. However, several applications with significant sharing degrees as well as those with single processor initialization routines benefit highly from the intelligent distribution of data provided by skew-mapping and prime-mapping schemes. Improvements due to the new schemes are found to be as high as 35% in stall time.
مقدمه انگلیسی
Cache coherent non-uniform memory access (CC-NUMA) systems have become extremely popular since they are scalable and provide transparent access to data. With multiple levels of caches, they certainly provide cached data at low latencies. However, once the data access gets beyond the layers of cache, these machines pay a high penalty and their performance deteriorates. Cache misses are made up of local and remote memory accesses. Local memory access latencies are usually a magnitude higher than cache access latencies. The access time to a remote memory in a large system could be several orders of magnitude higher than the cache access time because of the time spent in the interconnection network. Even with low miss rates, the bottleneck in the performance of NUMA systems lies in the remote memory access latencies. Effectively, the cache miss latencies depend heavily on the ratio of local and remote accesses. The memory management policy governs the placement of data in shared memory. It specifies which memory accesses would be local and which would be remote. In this paper, our main aim is to present different memory management policies and study their impact on the application performance and interconnection network performance of a CC-NUMA multiprocessor system. Related work in this area can be divided into two categories. The first category has been the performance evaluation of memory management policies [1], [2], [3], [4], [5] and [6]. Most of these studies [1], [2], [3] and [4] focused on distributed shared memory systems without hardware cache coherence. The effect of different policies was studied for CC-NUMA systems in [5] and [6]. Verghese et al. [5] presented significant data on the OS/hardware support required for dynamic page migration and replication policies. However, their results also indicate that dynamic memory management policies improve the performance of parallel applications (the SPLASH workload) by only 4% over the static schemes. Bhuyan, et al. [6] presented the impact of existing memory management policies and switch design alternatives on the application performance. Hence we concentrate only on static memory management techniques. We propose two improved static schemes: (1) skew-mapping, (2) prime-mapping that significantly improve the performance of several applications over the existing schemes. The skew-mapping scheme is based on skewing pages which are allocated to memories using the round-robin policy. It can be specified as a function that maps logical pages onto memories such that the required pages could be accessed conflict-free. Several general classes of skewing memory data accesses have been investigated and characterized by previous researchers [7], [8], [9] and [10], but not yet been developed for CC-NUMA memory management. The prime-mapping scheme is based on the allocation of data pages to memories according to a prime number. The use of a prime number for effective distribution of data accesses has been studied in a few papers [11], [12] and [13]. Lawrie [11] described a memory system designed for parallel array access which is based on the use of a prime number of memories in SIMD computers. Yang [12] presented a prime-mapped cache in the vector processing environment. The memory access logic of improved allocation schemes is similar to those of the existing policies, resulting in no additional delay for memory accesses. Further, the prime-mapped cache for vector computers was evaluated using a set of applications in [13]. However, they have not yet been developed for CC-NUMA multiprocessors. The second category of related work is the performance evaluation of interconnection networks (IN's). Performance evaluation of IN's has been an active area of research for a long time [14], [15], [16], [17] and [18]. These studies were conducted with a synthetic workload that is more suitable for a message passing or networking environment. Workload in a CC-NUMA environment is characterized by unsymmetrical bulky messages due to cache coherence, synchronization and memory management. Furthermore, due to the IN advancements, such as the use of virtual channels (VC's), the improvements in system performance have to be judged from the dynamic changes during the execution of applications. The aim of this paper is to re-evaluate the performance of an IN with realistic application data accesses and memory management policies governing the placement of the data blocks. Execution-based evaluations for IN's have been reported [19], [20] and [21] to test the effectiveness of VC's. While these studies provide useful data, they do not explore different memory management techniques as we do. We consider a multistage interconnection network (MIN) in this paper, similar to the one employed in Butterfly and Cedar multiprocessors [22]. Unlike Cedar, we employ a NUMA organization with one network (like Butterfly) that is used for both forward and backward (reply) messages. We evaluate two different switch design alternatives (simple wormhole (SWH) and buffered virtual channel (BVC)) [20] for the MIN. To evaluate the different memory management techniques in conjunction with different switch architectures, we have significantly modified our CC-NUMA simulator based on Proteus [23]. We developed virtual memory support in the Proteus simulator to evaluate the memory management techniques. We have also incorporated detailed switch models and wormhole routing with VC's in the network to accurately interpret the effect of network latency on application stall time. We have modified the directory based cache coherence protocol [24] in the simulator to correctly account for the possible out-of-order messages due to the presence of VC's in the MIN. Our simulation results show that the performance of memory management techniques depends on the application sharing pattern. Applications with a low degree of sharing benefit from memory management techniques such as first-touch. However applications with initialization routines or moderate to high sharing degrees do not enjoy the same benefits. For several such applications, skew-mapping and prime-mapping schemes provide significant benefits. The paper makes the following contributions: • We study and evaluate a spectrum of current static memory management schemes for CC-NUMA multiprocessors. As a result, we present insights regarding the bottlenecks that limit application performance. • We develop two improved static memory management policies, called skew-mapping and prime-mapping, to improve the performance of CC-NUMA systems. • We characterize the effect of page placement on the temporal and spatial locality of memory access patterns for several scientific applications. • In order to investigate the performance impact of existing and improved memory management policies accurately and extensively, we employ an execution-driven simulation methodology that fully models the interconnection network design with several choices for crossbar switch design. • Finally, we analyze application performance for these memory management policies at the system level and the network level. Interference at different stages in the MIN is also captured and analyzed. The rest of the paper is organized as follows. Section 2 gives an overview of the existing memory management policies and the details of two improved memory management policies. Section 3 describes the experimental methodology. Section 4 presents not only the data-sharing patterns of each application, but also temporal and spatial locality characteristics of each application which are correlated with each memory management policy. Section 4 also presents a detailed study of memory access patterns for a single application, namely fast Fourier transform (FFT), to accurately depict the effect of the memory management policies on the execution flow of the application. Section 5 presents the impact of the policies and the effect of the switch architectures on the network latency and application stall time. Finally, Section 6 presents the conclusions.
نتیجه گیری انگلیسی
In this paper, we explored a spectrum of memory management policies for CC-NUMA multiprocessors. We found that existing memory management policies, namely buddy, round-robin and first-touch, each have their own limiting factors for providing good application performance. The problems mainly were due to bulky arrivals of requests to the same memory module. To alleviate these problems, we introduced two improved memory management policies, called skew-mapping and prime-mapping. The memory access characteristics of five memory management policies, namely buddy, round-robin, first-touch, skew-mapping, and prime-mapping, were analyzed in detail. Our performance metrics covered interarrival time distributions, application access patterns and application stall time. A detailed analysis of the effect of each memory management scheme on application access patterns showed that considerable improvement in performance can be attained by employing intelligent static memory management policies. The impact of memory management on the various components of application stall time was analyzed and shown to be significant. The improvements in stall time for skew-mapping and prime-mapping policies is mainly due to a better distribution of the data among the various memories. The data distribution induces a skewing effect on the temporal access pattern of multiple requests from different processors to the same memory. Improvements were found to be as high as 35% in stall time. For the FFT and FWA applications, the performance using first-touch is better than with the skew-mapping and prime-mapping memory management policies due to low sharing degrees (read or write). However, first-touch is not effective for applications with moderate to high sharing degrees. The impact of the memory management on network performance was also analyzed and shown to be significant. We used two different switch architectures (SWH and BVC) to represent the advancements in the current interconnect technology. Incorporating buffers and virtual channels in the switch reduces the average message latency tremendously, but we found that performance improvements are very much dependent on the memory management policy.