ماتریس های بردار بزرگ در شبکه های اتوبوس توزیع شده با تاخیر ارتباطی با استفاده از الگوی بار بخش پذیر: تجزیه و تحلیل عملکرد و شبیه سازی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
27582 | 2001 | 22 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Mathematics and Computers in Simulation, Volume 58, Issue 1, 19 December 2001, Pages 71–92
چکیده انگلیسی
We present a performance analysis and experimental simulation results on the problem of scheduling a divisible load on a bus network. In general, the computing requirement of a divisible load is CPU intensive and demands multiple processing nodes for efficient processing. We consider the problem of scheduling a very large matrix–vector product computation on a bus network consisting of a homogeneous set of processors. The experiment was conducted on a PC-based networking environment consisting of Pentium II machines arranged in a bus topology. We present a theoretical analysis and verify these findings on the experimental test-bed. We also developed a software support system with flexibility in terms of scalability of the network and the load size. We present a detailed discussion on the experimental results providing directions for possible future extensions of this work.
مقدمه انگلیسی
The interest in network-based computing has grown considerably in recent times. In this environment, several computers are linked through a communication network to form large loosely-coupled distributed networks. One of the major attributes of such a distributed system is the capability that it offers to the user of any single node to exploit the considerable power of the complete network or a subset of it by partitioning and transferring its own processing load to the other processors in the network. This capability is useful in handling large computational loads. This kind of application is different from that of scheduling and load balancing which is more relevant to a distributed computing system composed of mainframe computers, each handling a large number of jobs. In this paper, we are concerned with the problem of distributing a single large load that originates at one of the nodes in the network. The load is massive in size compared to the computing capability of the node. So, the processor partitions the load into many fractions and distributes them among the processor nodes in the network. When the processing at each node is complete, the partial solutions are gathered and consolidated at the load originating processor to construct the complete solution. An important objective is to achieve a balance in the load distribution between processors so that computation is completed in the shortest possible time. This kind of processing load has the additional property that there is minimal inter-processor communication during the actual parallel execution of the program. However, there is a considerable communication delay at the beginning or at the end of a computational phase. The generic names of such loads are divisible loads [1], [2] and [3] and the theory of scheduling such divisible loads is commonly referred to as divisible load theory (DLT) [1], in the literature. The theory adopts a linear model for the processor speed and communication link speed parameters. In this model, the communication time over a channel is assumed to be proportional to the amount of load that is transferred over that channel, and the computation time is assumed to be proportional to the amount of load assigned to that processor. Also, the processing load is assumed to be arbitrarily divisible in the sense that, each partitioned portion of the load can be independently processed on any processor on the network. The primary concern in DLT research is to determine the optimal fractions of the total load to be assigned to the processors in such a way that the total processing time of the entire load is a minimum. The importance of the divisible load paradigm arises from the fact that although there is a well-developed literature on parallel processing algorithms, these algorithms have not been developed for distributed computing systems or network-based computing scenarios where the issues of communication delays and variable computation capability at processing nodes play a crucial role in the computational performance. Divisible load theory attempts to adapt these algorithms for optimised performance under these factors. However, this task is not straightforward since the communication delays are considerably higher in a distributed network environment than in a parallel processing environment. This is especially true for the processing of divisible loads since there is very little interprocessor communication during the actual computation process. So, an implementation in a parallel processor will not be affected much by communication overheads, unlike an implementation in a distributed computing environment where communication overheads have a very significant effect on optimal load distribution strategy. Below, we present a brief survey on some of the significant contributions in the DLT area. In [4], a heterogeneous linear network of processors was considered and a computational algorithm was developed to obtain the optimal load fractions by assuming that all the processors participating in the computation stop computing at the same time instant. In fact, this has been shown to be a necessary and sufficient condition for obtaining optimal processing time in linear networks by using the concept of processor equivalence [5]. An analytic proof of this assumption in bus networks was presented in [6]. A more general proof of this assertion is also available in [1]. In [7], closed form solutions for optimal schedule for bus and tree networks were derived. The concepts of optimal sequencing and optimal network arrangement were introduced in [8]. Barlas [9] obtained conditions on the optimal sequencing of load distribution in a single-level tree network by including the results collection phase. To find the ultimate speed-up using DLT analysis, Li [10] conducted the asymptotic analysis for various network topologies. In [11], finite-size buffer constraint was imposed on each processor and an efficient in-cremental balancing algorithm to find an optimal schedule, which satisfies the buffer constraints, was proposed. All these studies were concerned with designing strategies under the assumption that only one load is available for processing. This assumption was relaxed in [12] and an efficient algorithm was proposed for multiple divisible jobs under the assumption that all the loads originate at the scheduler in bus networks. In [13], scheduling of multiple divisible loads originating at various nodes of a bus network was considered and optimal amounts of load exchange among the sending and receiving processors were derived. Recently, the results of DLT were shown to be applicable to the domain of network-based multimedia video/movie-on-demand systems. Here, DLT played a role in recommending an optimal size of the partitions of a long duration movie to be retrieved from a set of servers to meet the demand of a client. In this scenario, a pool of servers co-operate to serve a single request so that the access time of the movie is minimised [14]. All the papers discussed above assumed the load to have a generic arbitrary divisibility property without actually specifying the exact nature of computations required. Consequently, the overall model turned out to be linear and the optimal schedule obtained was found to be independent of the size of the load. A notable exception is [15] where a realistic homogeneous bus-oriented network is used to partition and solve a large matrix–vector product problem. However, one of the lacuna all these papers have is the lack of experimental results to illustrate the actual speed-up improvement of the scheduling scheme. In this paper, we consider a bus network architecture and present experimental results that support the divisible load paradigm and show the speed-up improvement in an actual bus network-based computing environment. We take the matrix–vector product computation on a bus network of processors as the basic platform to illustrate the experimental results. The organisation of the paper is as follows: in Section 2, we introduce the bus architecture and the problem addressed in this paper. This section also presents all the required notations and terminology used throughout the paper. We then derive the load distribution equations and obtain a closed-form expression for the optimal processing time. Following this, we derive bounds on the maximal set of processors that can be used. We conduct an asymptotic analysis and obtain certain performance bounds of interest. In Section 3, details of the experimental set-up, the procedure followed to measure the speeds of the processors and links, and implementation of the strategy are given. Section 4 discusses the results obtained and compares the experimental and theoretical performance curves. Section 5 concludes the paper and discusses some useful extensions to this work.
نتیجه گیری انگلیسی
The problem of divisible load scheduling for a matrix–vector product computation was experimentally studied. In the area of divisible load theory research this is the first effort in obtaining experimental results. A test-bed consisting of high speed Pentium series machines was used to create a bus network. As per the theoretical results, the speed parameters were fairly accurately measured and the load distribution was carried out. All the theoretical findings were successfully verified. This contribution is expected to be significant, as it highlights interesting issues between the experimental and theoretical results. In fact, this study allows the researchers to decide on the veracity of the assumptions used to construct the theoretical models. For instance, neglecting the effect of solution back propagation time to the originator turns out to be a practical and realistic assumption if the size of the solution data is small and the network is a high-speed one. With our experiments, we have conclusively shown the benefits of using the divisible load paradigm in a distributed computing network with high speed machines and links. With the use of 100 Mb Ethernet networks, the delays experienced due to data transmission becomes even smaller. In fact, in our experiments, since the number of elements to be communicated to a processor is large, we could measure this transmission time with sufficient accuracy. Alternatively, when the results of the product computation are small and need to be propagated back (this involves just 200 elements on the whole), the communication delays become insignificant. It is to be noted that in general the need to account for the communication delay while distributing load in a distributed computing network is a valid premise in a network-based environment. However, the applications requirements dictate whether it is critical to take communication delays into account for optimal performance. This is essentially linked to a most fundamental aspect of parallelism, namely, the trade-off between the scalability of the problem and the size of the network (in terms of its available computing power). The conclusions drawn from this experimental investigation definitely serves as a useful input to DLT research and will provide a good beginning in assessing the applicability of the results of DLT. We strongly believe that the impact of DLT will be significant for applications that demand CPU intensive computations, and is not just restricted to a matrix–vector product computation problem. In fact, in a recent work [19], the usefulness of this divisible load paradigm is clearly demonstrated in the case of image processing applications. It would be interesting and useful to conduct experimental investigations similar to the one reported in this paper, in the domain of image processing where there is a large scope for exploiting data parallelism.