In this article, the performance of data mining and statistical techniques was empirically compared while varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. Our study employed 60 simulated examples, with artificial neural networks and decision trees as the data mining techniques, and linear regression as the statistical method. In the performance study, we use the RMSE value as the metric and come up with some additional findings: (i) for continuous independent variables, a statistical technique (i.e., linear regression) was superior to data mining (i.e., decision tree and artificial neural network) regardless of the number of variables and the sample size; (ii) for continuous and categorical independent variables, linear regression was best when the number of categorical variables was one, while the artificial neural network was superior when the number of categorical variables was two or more; (iii) the artificial neural network performance improved faster than that of the other methods as the number of classes of categorical variable increased.
The difficulties posed by prediction problems have resulted in a variety of problem-solving techniques. For example, data mining methods comprise artificial neural networks and decision trees, and statistical techniques include linear regression and stepwise polynomial regression. It is difficult, however, to compare the efficacy of the techniques and determine the best one because their performance is data-dependent.
A few studies have compared data mining and statistical approaches to solving prediction problems. Gorr, Nagin, and Szczypula (1994) compared linear regression, stepwise polynomial regression, and neural networks in the context of predicting student GPAs. Although they found that linear regression performed best overall, none of the methods performed significantly better than the ordering index used by the investigator. Shuhui, Wunsch, Hair, and Giesselmann (2001) reported that neural networks performed better than linear regression for wind farm data, while Hardgrave, Wilson, and Walstrom (1994) experimentally showed that neural networks did not significantly outperform statistical techniques in predicting the academic success of students entering the MBA program. Subbanarasimha, Arinze, and Anadarajan (2000) demonstrated that linear regression performed better than neural networks when the distribution of the dependent variable was skewed, and Kumar (2005) expanded on Subbanarasimha et al. (2000) result, developing a hybrid method that improved the prediction accuracy.
These comparison studies have mainly considered a specific data set or the distribution of the dependent variable. Other unexplored criteria, however, affect the performance of decision problem techniques, such as sample size and characteristics of the independent variables. We empirically compared the performance of data mining and statistical techniques while varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. Our study employed 60 simulated examples, with artificial neural networks and decision trees as the data mining techniques, and linear regression as the statistical method.
In addition to these general comparison results, we used the RMSE value as the metric and determined the following: for continuous independent variables, a statistical technique (i.e., linear regression) was superior to data mining (i.e., decision tree and artificial neural network) regardless of the number of variables; for continuous and categorical independent variables, linear regression was best when the number of categorical variables was one, while the artificial neural network was superior when the number of categorical variables was two or more; and the artificial neural network performance improved faster than that of the other methods as the number of classes of categorical variable increased.
The article is organized as follows. Section 2 illustrates the generation of the data sets and analysis methods for the empirical study. The experimental results are described in Section 3, and the conclusions and future research directions are presented in Section 4.
In this article, we present the results of an experimental comparison study of data mining and statistical techniques based on varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. To evaluate the performance of the different techniques, we generated various simulated problems and used the RMSE metric.
The main results include the following: when independent variables are continuous, LR is superior to both DT and ANN regardless of the number of variables; when independent variables are continuous and categorical, LR performs best when the number of categorical variables is small (i.e., CA = 1), while ANN is the best when the number of categorical variables is two or more; and ANN performance improves more relative to LR and DT performance as the number of classes of categorical variables increases.
The above results were derived from simulated data and need further verification using a variety of actual data. However, the results are meaningful in that this study provides the first comparison between statistical and data mining techniques based on the characteristics of the independent variables. In addition, the results of this study provide insight for selecting the most appropriate prediction method for a problem based on characteristics of the problem’s independent variables. A promising area of future research would be in applying this approach to compare the performance of classification methods.