یک تحقیق تجربی از تاثیر تجمیع بر عملکرد داده کاوی با رگرسیون لجستیک
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|22063||2005||13 صفحه PDF||سفارش دهید||4884 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information & Management, Volume 42, Issue 5, July 2005, Pages 695–707
We studied the impact of data aggregation on the performance of logistic regression on predicting the direction of the Dow Jones industrial average (DJIA) stock market index. Data aggregation is a common operation in business, science, engineering, medicine, etc.; it is performed for purposes such as statistical, financial, and sales and marketing analysis — particularly within the context of a data warehouse. We showed experimentally that, for this example, as long as aggregation does not shrink the sample size unduly, it does not significantly impair the performance of the logistic regression model for predicting the direction of the DJIA stock market index. We also observed that aggregation-based models are simpler (less over-parameterized) than detail-based models. We used the receiver operating characteristic (ROC) analysis to evaluate the robustness of such predictive models. Specifically, we used the area under the ROC curve as a summary measure of the overall performance of a given model.
Data aggregation here refers to any data roll-up process, such as averaging and summing, in which information is expressed in a summary form. It is a common practice in various disciplines; including business, science, engineering, and medicine. Aggregation is performed for purposes such as statistical, financial, and sales and marketing analysis. The impact of data aggregation on the performance of data mining algorithms is of particular relevance to business data within a data warehouse, which we define here as a repository of data that is clean, integrated, complete, and summarized and thus “sets the stage for effective data mining” . Thus understanding the implications of data aggregation on data mining algorithms is of considerable importance for the proper utilization of such data assets
نتیجه گیری انگلیسی
Experimental evidence, admittedly on a single problem, has been presented that demonstrates that aggregation-based data mining with logistic regression is statistically as robust as detail-based data mining with logistic regression. The experiment tested the performance of 2-, 3-, and 4-day aggregation-based models against the performance of one based on the complete detail data. ROC analysis, a well-established technique in diagnostics, was used for model assessment. The study showed, for the test case used here, that the performance of the aggregation-based models was not statistically significantly different from the performance of the detail-based model. If the detail data is aggregateable, then aggregation can be performed within a data preparation phase for data mining, with the advantage of utilizing the whole data in the mining process. Aggregation in our experiment also lead to more parsimonious models as it smoothed out some of the noise in the detail data. Parameters included in a detail-based model to capture noise in the detail data will not be needed in aggregation-based models.