ترجمه فارسی عنوان مقاله

تجزیه و تحلیل عملکرد از پردازش جستجو "پس از عضویتGroupby " در سیستم های پایگاه داده موازی

عنوان انگلیسی

Performance analysis of “Groupby-After-Join” query processing in parallel database systems

کد مقاله	سال انتشار	تعداد صفحات مقاله انگلیسی
27831	2004	26 صفحه PDF

منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Information Sciences, Volume 168, Issues 1–4, 3 December 2004, Pages 25–50

ترجمه کلمات کلیدی

- پردازش جستجو موازی - بهینه سازی جستجو موازی - پایگاه داده های موازی - تجزیه و تحلیل عملکرد -

کلمات کلیدی انگلیسی

Groupby queries, Groupby-join queries, Parallel query processing, Parallel query optimization, Parallel databases, Performance analysis,

دانلود رایگان 2 صفحه اول مقاله لاتین (PDF)

پیش نمایش مقاله

چکیده انگلیسی

Queries containing aggregate functions often combine multiple tables through join operations. This query is subsequently called “Groupby-Join”. There is a special category of this query whereby the group-by operation can only be performed after the join operation. This is known as “Groupby-After-Join” queries––the focus of this paper. In parallel processing of such queries, it must be decided which attribute is used as a partitioning attribute, particularly join attribute or group-by attribute. Based on the partitioning attribute, two parallel processing methods, namely join partition method (JPM) and aggregate partition method (APM) are discussed. The behaviours of these parallelization methods are described in terms of cost models. Experiments are performed based on simulations. The simulation results show that the aggregate partition method performs better than the join partition method.

مقدمه انگلیسی

Queries involving aggregates are very common in database processing, especially in on-line analytical processing (OLAP), and data warehouse [1] and [3]. These queries are often used as a tool for strategic decision making. Queries containing aggregate functions summarize a large set of records based on the designated grouping. The input set of records may be derived from multiple tables using a join operation. This kind of queries is called “Groupby-Join” queries, in which the queries contain aggregate functions and join operations. As the data repository for integrated decision making grows, aggregate queries need to be executed efficiently. Large historical tables need to be joined and aggregated each other; consequently, effective optimization of aggregate functions has the potential to result in huge performance gains. This paper will focus on the use of parallel query processing techniques in Groupby-Join queries, whereby the group-by operations can only be performed after the join operation––therefore we call this “Groupby-After-Join” queries. The work presented in this paper is part of a larger project on parallel aggregate query processing consisting of three parts: parallel group-by [14], parallel groupby-before-join [16], [17] and [18] and parallel groupby-after-join [15]. The first part of this project involved with parallelization of Group-By queries on a single table and there is no involvement of join operation. The results have been reported in the Computer Systems: Science and Engineering International Journal [14]. The second part focused on parallelization Groupby-Join queries where the Join attribute is the same as the Group-by attribute resulting that the group-by operation can be performed first before the join for optimization purposes. The outcome of the second part was published at Springer LNCS [17]. In this paper, the focus is mainly on the third part, parallel groupby-after-join, also known as aggregate-join. It concentrates on the parallelization of GroupBy-Join queries where the Group-By attributes are different from the Join attributes; consequently the join operation must be carried out first and then followed by group-by operation. Previous work [15] identified two parallel processing methods for groupby-after-join queries, namely join partition method (JPM), aggregate partition method (APM). The JPM and APM methods mainly differ in the selection of partitioning attribute for distributing workloads over the processors. The objective of this paper is not to propose new parallelization methods for Groupby-After-Join queries, but rather to perform an evaluation of the join partition method and aggregate partition method. The main reason is that most existing work concentrates on identifying parallelization models for this type of query. A complete analysis has yet to be made. In this paper, a through analysis of the two parallelization techniques proposed in our previous work [15] is presented. A comparison between these two parallelization methods is also made. The rest of this paper is organized as follows. Section 2 explains the background of the aggregate queries. Section 3 explains previous work on parallelization models for join partition method and aggregate partition method. Section 4 presents the cost models for both methods. Section 5 presents the comparison results. And lastly, Section 6 presents the conclusion and future work

نتیجه گیری انگلیسی

In this paper, two parallel algorithms have been investigated for processing Groupby-After-Join queries in high performance parallel database systems. These two algorithms are the aggregate partition method, and the join partition method. From this study it is concluded that the aggregate partition method is preferable when the number of joins produced is going to be large, but not when the number of joins produced is relatively small. The join partition method on the other hand is preferred when the number of joins produced is small, but it suffers serious performance problem once the number of joins produced by the query increases. The performance evaluation results show that variation in faster disk has the main potential to obtain most efficient performance in all situations. As additional, consecutively increasing number of processor, speeding up the CPU and adding bigger memory are some other techniques suggested to be applied as the number of groups produced is going to be large. Future work is being planned to investigate high dimensional Group By operations, which is often identified as Cube operations [16], which are highly pertinent to data warehousing applications. Since this type of applications normally involves large amount of data, parallelism is necessary in order to keep the performance level acceptable.