برنامه ریزی شغلی با برآورد زمان تنظیم شده بر روی ابررایانه های تولید
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|20360||2013||13 صفحه PDF||سفارش دهید||8800 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Journal of Parallel and Distributed Computing, Volume 73, Issue 7, July 2013, Pages 926–938
The estimate of a parallel job’s running time (walltime) is an important attribute used by resource managers and job schedulers in various scenarios, such as backfilling and short-job-first scheduling. This value is provided by the user, however, and has been repeatedly shown to be inaccurate. We studied the workload characteristic based on a large amount of historical data (over 275,000 jobs in two and a half years) from a production leadership-class computer. Based on that study, we proposed a set of walltime adjustment schemes producing more accurate estimates. To ensure the utility of these schemes on production systems, we analyzed their potential impact in scheduling and evaluated the schemes with an event-driven simulator. Our experimental results show that our method can achieve not only better overall estimation accuracy but also improved overall system performance. Specifically, the average estimation accuracy of the tested workload can be improved by up to 35%, and the system performance in terms of average waiting time and weighted average waiting time can be improved by up to 22% and 28%, respectively.
In a supercomputing systems, the job runtime estimate, also called requested walltime, is an important job attribute provided by users at job submission. Although this value was originally used by resource managers to kill a job at its expiration, the value is also heavily used in job scheduling. Backfilling , for example, needs to know the expected runtime of both running and waiting jobs so that it can fill short jobs into backfilling windows, reducing fragmentation without delaying high-priority jobs. Some schedulers favor short jobs in order to achieve improved average response time ; they need to know the runtime estimates of the waiting jobs when sorting the queue. Moreover, job runtime estimates are essential to other resource management strategies, such as advance reservation , queuing time prediction  and , and walltime-aware job allocation reducing fragmentation on torus-connected systems . However, user estimates of job running time have been repeatedly demonstrated to be highly inaccurate ,  and . Indeed, a large number of jobs consume only a small portion of the walltime requested. A number of studies have been done to investigate whether such inaccuracy can impact job scheduling performance. Surprisingly controversial results have been reported. On one hand, some claimed inaccuracy is helpful. For example, Mu’alem et al.  reported that the inaccurate runtime estimates have the potential to be beneficial because of backfilling; such results have led to the suggestion that estimates should be doubled  or randomized  to make them even less accurate. On the other hand, some others suggested accuracy is more favorable. Studies have shown that using more accurate runtime estimates can improve system performance far more significantly than previously suggested ,  and . In this paper, we present a set of walltime adjustment schemes that can be used by large-scale production systems directly. First, we studied workload characteristics based on a large amount of historical data (275,000 jobs in 30 months) from a leadership-class computer. Next, we proposed a set of walltime adjustment schemes to produce more accurate estimates, and we discussed how to configure each scheme for real computer systems. We evaluated the performance of our walltime adjustment schemes on production machines using simulations with real workloads. Our experimental results show that our method can achieve not only better overall estimation accuracy but also improved overall system performance. Specifically, the average and median of estimation accuracy of the tested workload can be improved by up to 35% and 42%, respectively. Moreover, the system performance in terms of average waiting time and weighted average waiting time can be improved by up to 22% and 28%, respectively. In this paper, several terms regarding job runtime are used repeatedly. For example, we use job actual runtime (tacttact) for job execution time. We use user-requested walltime (treqtreq), or simply walltime, to represent the runtime estimates provided by users at job submission; the resource manager kills jobs when this time expires. We use tschedtsched to represent the job walltime used by scheduler for prioritizing and backfilling jobs. Usually, tschedtsched equals treqtreq; but in this work, tschedtsched can be other adjusted values. In this context, the term “walltime adjustment” refers to the effort of a system to adjust the user’s estimates to create possibly more accurate walltime estimates. The term “walltime estimate” refers to the runtime estimate either provided by users or adjusted by the system. The remainder of this paper is organized as follows. Section 2 discusses some related work. Section 3 presents our study of historical job traces. Section 4 presents our walltime adjustment schemes and analytical evaluation. Section 5 presents our analysis of the impact of imperfect prediction and an enhancement for utilizing walltime adjustment. Section 6 presents a performance evaluation of scheduling using enhanced walltime adjustment. Section 7 summarizes our conclusions.