دانلود مقاله ISI انگلیسی شماره 22063
ترجمه فارسی عنوان مقاله

یک تحقیق تجربی از تاثیر تجمیع بر عملکرد داده کاوی با رگرسیون لجستیک

عنوان انگلیسی
An experimental investigation of the impact of aggregation on the performance of data mining with logistic regression
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
22063 2005 13 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Information & Management, Volume 42, Issue 5, July 2005, Pages 695–707

ترجمه کلمات کلیدی
داده کاوی - تجمیع - رگرسیون لجستیک - پیش بینی - مدل سازی پیش بینی شده - سطح زیر منحنی - ارزیابی مدل - عملکرد مدل - انبار داده ها
کلمات کلیدی انگلیسی
Data mining, Aggregation, Logistic regression, Prediction, Predictive modeling, Area under the ROC curve, Model assessment, Model performance, Data warehouse
پیش نمایش مقاله
پیش نمایش مقاله  یک تحقیق تجربی از تاثیر تجمیع بر عملکرد داده کاوی با رگرسیون لجستیک

چکیده انگلیسی

We studied the impact of data aggregation on the performance of logistic regression on predicting the direction of the Dow Jones industrial average (DJIA) stock market index. Data aggregation is a common operation in business, science, engineering, medicine, etc.; it is performed for purposes such as statistical, financial, and sales and marketing analysis — particularly within the context of a data warehouse. We showed experimentally that, for this example, as long as aggregation does not shrink the sample size unduly, it does not significantly impair the performance of the logistic regression model for predicting the direction of the DJIA stock market index. We also observed that aggregation-based models are simpler (less over-parameterized) than detail-based models. We used the receiver operating characteristic (ROC) analysis to evaluate the robustness of such predictive models. Specifically, we used the area under the ROC curve as a summary measure of the overall performance of a given model.

مقدمه انگلیسی

Data aggregation here refers to any data roll-up process, such as averaging and summing, in which information is expressed in a summary form. It is a common practice in various disciplines; including business, science, engineering, and medicine. Aggregation is performed for purposes such as statistical, financial, and sales and marketing analysis. The impact of data aggregation on the performance of data mining algorithms is of particular relevance to business data within a data warehouse, which we define here as a repository of data that is clean, integrated, complete, and summarized and thus “sets the stage for effective data mining” [16]. Thus understanding the implications of data aggregation on data mining algorithms is of considerable importance for the proper utilization of such data assets

نتیجه گیری انگلیسی

Experimental evidence, admittedly on a single problem, has been presented that demonstrates that aggregation-based data mining with logistic regression is statistically as robust as detail-based data mining with logistic regression. The experiment tested the performance of 2-, 3-, and 4-day aggregation-based models against the performance of one based on the complete detail data. ROC analysis, a well-established technique in diagnostics, was used for model assessment. The study showed, for the test case used here, that the performance of the aggregation-based models was not statistically significantly different from the performance of the detail-based model. If the detail data is aggregateable, then aggregation can be performed within a data preparation phase for data mining, with the advantage of utilizing the whole data in the mining process. Aggregation in our experiment also lead to more parsimonious models as it smoothed out some of the noise in the detail data. Parameters included in a detail-based model to capture noise in the detail data will not be needed in aggregation-based models.