تجزیه و تحلیل عملکرد صوری پارچه های سوئیچینگ توزیع شده برای سیستم های مبتنی بر SCI
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
27424 | 2000 | 11 صفحه PDF |

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Microprocessors and Microsystems, Volume 24, Issue 1, 29 March 2000, Pages 1–11
چکیده انگلیسی
This paper presents the results of a simulative performance study on 1D and 2D k-ary n-cube topologies as distributed switching fabrics for the Scalable Coherent Interface (SCI). Case studies are conducted on multiprocessor SCI networks composed of simple rings, counter-rotating rings, unidirectional and bidirectional tori, and tori with rings of uniform size. Based on a novel set of verified high-fidelity models, the results identify fundamental performance characteristics associated with each of these SCI fabrics, and tradeoffs between them, in terms of throughput and latency. Limits on scalable performance from SCI with increase in complexity and dimensionality are clarified, supporting decisions for advanced multiprocessor design.
مقدمه انگلیسی
The performance and scalability of high-speed computer networks have become critically important characteristics in the design and development of advanced distributed and parallel processing systems. Many applications require or benefit from the use of an interconnect capable of supporting shared memory in hardware, and chief among such interconnects for multiprocessors is the Scalable Coherent Interface. However, in order for SCI interconnects to scale to ever larger system sizes and support a host of embedded and general-purpose applications, a distributed switching fabric is required that will scale with the number of nodes. One of the most promising families of topology for distributed switching fabrics is the k-ary n-cube topology, a family originally investigated for high-end supercomputing and referenced widely in the literature as the target for algorithm mapping. The SCI standard is targeted towards increasing the bandwidth of backplane buses and became an IEEE standard in March of 1992 [7]. It improves on the bandwidth of buses by using high-speed, ring-connected, point-to-point links. With a link speed of 1 GB/s (i.e. a gigabyte per second), addressing for up to 64K nodes, and a cache-coherence protocol for distributed shared-memory systems, the popularity of SCI for use in large multiprocessors has continued to increase. Sequent, Cray, and HP-Convex are among the parallel computer vendors that have developed proprietary implementations of SCI for their high-end systems. Sequent developed the IQ-link implemented in their Numa-Q 2000 system to connect groups of four processors in a ring structure [8]. Cray developed the SCX channel, also known as the GigaRing, capable of sustained half-duplex bandwidths of 900 MB/s [11] and [12]. The HP-Convex Exemplar Series uses the SCI-based Coherent Torroidal Interconnect (CTI) to interface hypernodes consisting of eight processing units each. SCI has also gained recognition in the workstation cluster market. To date, Dolphin Interconnect Solutions has emerged as the leading manufacturer of SCI adapter cards and switches for clusters. The Dolphin switch relies on a bus-based internal switch architecture called the B-link. The B-link is capable of a bandwidth ranging from 200 to 400 MB/s depending on the operating clock speed. Sun has adopted the Dolphin implementation of SCI, dubbed CluStar, for their Enterprise Cluster systems. Recently, Dolphin introduced a dual-ported PCI/SCI adapter card from which to construct unidirectional 2D torus topologies for SCI. Data General, in collaboration with Dolphin, has developed a chipset for their AV20000 Enterprise server to interface SCI to Intel's Standard High Volume (SHV) server nodes [2]. In addition, Dolphin and Siemens jointly developed a PCI–SCI bridge to be used in the I/O subsystems of the Seimens RM600E Enterprise Server systems. While Dolphin's B-link bus provides a cost-effective approach to internal switch architecture, it is limited in its scalability and support for multidimensional network topologies. In this paper, a crossbar-based SCI switch model is presented that does not suffer from these limitations. The switch uses routing tables that are automatically generated at startup and which guarantee the shortest path to each packet's final destination. The performance of k-ary n-cube systems is explored by conducting experiments with a fixed ring size over a variable number of total nodes. The k-ary n-cube family consists of direct networks with n dimensions and k nodes per dimension, and members include rings, meshes, tori, hypercubes, etc. These networks provide excellent scalability through a constant node degree (i.e. fixed number of ports per node despite the size of the system), and low latencies through smaller diameters (i.e. fewer hops when transferring packets from source to destination). Additionally, they provide topologies that have served as targets for many studies on the mapping of parallel algorithm graphs for multiprocessing and multicomputing. Work by Dally [3], Chung [1], and Reed and Grunwald [10] provide detailed analytical analyses of generic k-ary n-cube networks. By contrast, in this paper we concentrate on a simulative approach of applied research to determine the performance of k-ary n-cube networks constructed with SCI. Through simulation with high-fidelity models for SCI, more accurate results can be obtained to study the relationship and impact of selected switching topologies for SCI multiprocessor networks. The remainder of this paper is organized as follows. Section 2 introduces the Scalable Coherent Interface and its basic operation. Section 3 describes the SCI switch model, and Section 4 presents the performance simulation results for several k-ary n-cube topologies. Finally, conclusions and directions for future research are discussed in Section 5.
نتیجه گیری انگلیسی
In this paper, the performance of several promising SCI topologies for distributed multiprocessor networks was examined through high-fidelity, CAD-based simulation with analytical verification. 1D SCI networks were the first cases studied, since they form the basis for all SCI networks. 2D tori were explored next, culminating in a case study on uniform-ring SCI networks for high-performance, fault-tolerant multiprocessor networks. It was shown both analytically and through simulation that the aggregate throughput of single- and dual-ring SCI systems is independent of the network size. From a theoretical standpoint, the total effective throughput of a multiprocessor system based on a single SCI ring is bounded above at 1.39 GB/s, with a more practical limit of approximately 1.35 GB/s as determined through detailed simulations. By contrast, systems constructed from dual, counter-rotating SCI rings are bounded above at 4.57 GB/s, with a practical limit of approximately 3.6 GB/s. As such, the throughput of a dual-ring SCI network was found to be approximately 2.7 times higher than the single ring, and the latency was 25–47% lower for systems up to 10 nodes with the evidence suggesting a latency improvement of over 50% for larger systems. SCI networks can be made scalable by increasing the number of rings in the topology and adding an additional dimension to the single- and dual-ring systems. The 2D topologies (i.e. tori) demonstrated a more scalable throughput that increases as a function of the system size. In the unidirectional tori, analytical results indicate that the throughputs scale according to and simulation results found a practical limit approximately 10% lower, where N represents the number of nodes in the system. However, the latencies were actually 10–20% higher for the unidirectional SCI tori versus counter-rotating rings, due to the ring-to-ring switching delay required in a torus fabric. The throughputs of the bidirectional torus topologies were found to be between 2.5 and 3 times higher than the throughputs of their unidirectional counterparts. Ideally, the throughputs of these systems should be approximately four times the throughputs of the unidirectional ones. However, the increase in the number of busy-retry packets brought on by increased contention in the distributed switches, coupled with the longer distances traveled by both the echo and busy-retry packets, reduce the total effective throughput that can be obtained. The bidirectional tori were found to have 18–29% lower latency for systems up to 36 nodes, with the gap widening for larger systems. The evidence suggests that these latencies will approach but not attain the 50% reduction indicated by the theoretical minimum. Finally, through the use of several simulation experiments, it was determined that despite an increase in ring-to-ring switching delays brought on by a higher average number of hops from source to destination, the uniform-ring networks nevertheless achieve throughputs comparable to their bidirectional torus counterparts. The evidence indicates that the additional aggregate bandwidth inherent to the uniform-ring topology balances the additional switching delays that are imposed. However, while the uniform-ring topology does provide more inherent capability for fault tolerance, the results indicate a latency increase of up to 25% for systems up to 36 nodes, and even more for larger systems. The results presented in this paper represent the first high-fidelity simulations of SCI multiprocessor networks with k-ary n-cube topologies for investigations into the performance scalability of SCI in terms of throughput and latency versus topology size and dimension. As summarized above, the results are significant as they identify and clarify the performance envelopes and tradeoffs associated with 1D and 2D topologies available for building SCI-based systems. These characteristics, and the novel set of verified CAD models that make them possible, are critical in providing insight for the design of next-generation multiprocessors and multicomputers with this promising interconnect technology. The models and simulations provided in this paper were intended to represent a distributed shared-memory multiprocessor constructed from computers connected via an SCI system-area network. However, the results could also be applied on a lower level where the individual distributed switching nodes are each attached to a processor, memory module, or I/O controller. Several activities are anticipated for future research in this area. For instance, the SCI networking results presented herein can be extended with application-oriented results to study the performance levels provided to particular distributed parallel algorithms and applications. These activities can be pursued in terms of trace-driven simulations, where access patterns are sampled from real applications and the traces drive or stimulate the models, or with execution-driven simulations where real applications execute on virtual prototypes via simulation. In addition to a broader study from an application level, the study of 3D SCI fabrics and beyond is also anticipated. Finally, whereas this paper focuses on performance attributes, future work will include the study of dependability attributes with the development and analysis of fault-tolerance mechanisms in and for SCI through simulation.