The generation of predictive models is a frequent task in data mining with the objective of generating highly precise and interpretable models. The data reduction is an interesting preprocessing approach that can allow us to obtain predictive models with these characteristics in large size data sets. In this paper, we analyze the rule classification model based on decision trees using a training selected set via evolutionary stratified instance selection. This method faces the scaling problem that appears in the evaluation of large size data sets, and the trade off interpretability-precision of the generated models.
A basic process in data mining is the generation of representative models from data [1]. The models, depending on their domain of application, can be descriptive or predictive. The classical objective of predictive models is the accuracy or precision of the model. On the other hand, the interpretability of the model is an important aspect for the expert point of view, to understand the model behaviour [2]. In classical literature, we can find different proposals to measure the quality of the predictive models, as well as the precision, like simplicity, interpretability, etc. [3].
In this paper we are going to focus our attention on the predictive models based on classification rules for different size data sets, with the special interest in the trade off interpretability-precision [2]. Our models have been extracted from the data sets by means of C4.5 algorithm [4].
A possible way to improve the behaviour of predictive models, precision and interpretability, is to extract them from suitable reduced/selected training sets [5]. Training set selection can be developed using instance selection algorithms. The instance selection algorithms select representative instance subsets following a determined strategy, and they can improve the nearest neighbour rule prediction capabilities used in some cases as selection strategy objective [6] and [7]. In [5], Sebban et al. study the effect of the learning set size in decision trees performances. An important conclusion of this analysis is that the application of instance selection algorithms (and concretely, the PSRCG algorithm) can improve the generalization accuracy, reduce the decision tree size and tolerate the presence of noise, establishing a close link between instance selection and tree simplification.
Evolutionary algorithms (EAs) are adaptable methods based on natural evolution that can be applied to search and optimization problems [8], [9] and [10]. The EAs offer interesting results when they are assessed on instance selection [11] and [12]. In this study, we use CHC algorithm as EA [13], considering its behaviour shown in [14]. The basic idea consists of combining in the fitness function both objectives, interpretability and precision [14] and [15].
The evaluation of instance selection algorithms over large size data sets makes them ineffective and inefficient. The effect produced by the size of data set in the algorithms is called scaling problem.
We focus our attention on evolutionary instance selection for large size data sets with the aim of extracting high precise-interpretable rules. To tackle the scaling problem we combine the stratification of the data sets with the instance selection over them [15]. The stratification reduces the original data set size, splitting it into strata where the selection will be applied. We analyze the selected training sets quality by means of the models (decision trees) extracted from them by means of C4.5, from the precision and interpretability perspectives. To compare the results we provide a statistical analysis using some statistical tests (ANOVA, Levene and Tamhane [16]).
The outline of the document is the following. In Section 2, we analyze the predictive models and their extraction using C4.5, presenting the measures considered to assess their behaviour. Section 3 describes the training set selection process and the drawbacks that the evaluation of very large data sets introduced in the instance selection algorithms. Section 4 presents the evolutionary stratified instance selection process applied to training set selection. Section 5 contains the experimental study developed, offering the methodology followed, the results and their analysis. Finally, in Section 6 we will point out some concluding results.
In this contribution we have analyzed the extraction of classification rule-based models by means of evolutionary stratified training set selection. The quality of the models has been evaluated considering their accuracy and interpretability.
The main conclusions reached are the following:
•
The evolutionary stratified instance selection (CHC R-P) offers the best model size, maintaining an acceptable accuracy. It produces the smallest set of rules, with the minimal number of rules and the smallest number of antecedents per rule.
•
The stratified CHC I-P allows us to obtain models with high test accuracy rates, similar to C4.5, but with the advantage of the size of the models that are reduced considerably.
Finally, we can conclude that the predictive model extraction by means of evolutionary stratified training set selection (CHC R-P or I-P) presents a good trade off between accuracy and interpretability. Our proposals present a very good scaling up behaviour, obtaining good results when the size of data set grows.