In this paper, a dynamic over-sampling procedure is proposed to improve the classification of imbalanced datasets with more than two classes. This procedure is incorporated into a Hybrid algorithm (HA) that optimizes Multi Layer Perceptron Neural Networks (MLPs). To handle class imbalance, the training dataset is resampled in two stages. In the first stage, an over-sampling procedure is applied to the minority class to partially balance the size of the classes. In the second, the HA is run and the dataset is over-sampled in different generations of the evolution, generating new patterns in the minimum sensitivity class (the class with the worst accuracy for the best MLP of the population). To evaluate the efficiency of our technique, we pose a complex problem, the classification of 1617 real farms into three classes (efficient, intermediate and inefficient) according to the Relative Technical Efficiency (RTE) obtained by the Monte Carlo Data Envelopment Analysis (MC-DEA). The multi-classification model, named Dynamic Smote Hybrid Multi Layer Perceptron (DSHMLP) is compared to other standard classification methods with an over-sampling procedure in the preprocessing stage and to the threshold-moving method where the output threshold is moved toward inexpensive classes. The results show that our proposal is able to improve minimum sensitivity in the generalization set (35.00%) and obtains a high accuracy level (72.63%).
Classification problems based on imbalanced training datasets often occur in applications where there are rarely events of interest. That is, the size of interesting minority groups is usually in a rather small proportion in the training dataset (Chawla et al., 2006 and Zhao and Huang, 2007). Imbalanced training datasets often results in low classification accuracies for minority classes (He and Garcia, 2009, Sun et al., 2009 and Torres et al., 2009).
Many techniques are proposed to solve this kind of classification problem through either data (Kubat & Matwin, 1997) or algorithmic levels (Pazzani et al., 1994). In this paper, a dynamic over-sampling procedure (hybrid approach between data and algorithmic solutions) is proposed to improve the classification of imbalanced datasets that have more than two classes. The base over-sampling procedure is the Synthetic Minority Over-sampling Technique (SMOTE) (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). This procedure has been applied in several research fields, for example in predictive microbiology (Fernández-Navarro et al., 2010 and Fernández-Navarro and Hervás-Martı´nez et al., 2011).
This procedure is incorporated into a Hybrid algorithm (HA) (Moscato & Cotta, 2003) that optimizes Multi Layer Perceptron Neural Networks (MLPs). The HA combines an Evolutionary algorithm (EA) (Back, 1996), a clustering process, and a Local Search (LS) procedure. The main objective of this research is, due to the unbalanced class structure (Fernández et al., 2009 and Sun et al., 2009), to check dynamic oversampling methods, where the class that increases its size is the one that has minimum sensitivity (MS) during the evolutive process. The base algorithm was proposed in Fernández-Navarro, Hervás-Martı´nez, and Gutíerrez (2011).
In recent years, several research projects related to DEA models have been developed, in the area of data mining, of which we highlight the papers by Toloo, Sohrabi, and Nalchigar (2009) and Yeh, Chi, and Hsu (2009). In the research works of Wu (2009) and Tsai, Lin, Cheng, and Lin (2009), the combination of neural networks and DEA models have already been applied successfully.
The performance of the proposed methodology was evaluated in a real problem which consists of classifying 1617 farms into three classes (efficient, intermediate and inefficient) according to Relative Technical Efficiency (RTE) obtained by use of the Monte Carlo Data Envelopment Analysis (MC-DEA) model on the 65 Agrarian Productive Strategies (APS) or typologies identified in the original database. The classification problem is very complex due to unbalanced class structure and the way in which this has determined the class each farms belongs. (see Section 3.1.1).
This paper is organized as follows: Section 2 describes the base classifier, the learning algorithm and over-sampling approaches; Section 3 explains the experiments carried out and a brief analysis of the database; Section 4 reports on the results obtained with the proposed methods and the results with methodologies used for comparative purposes and, finally, Section 5 summarizes the conclusions of our work.
This paper combines three powerful techniques used in machine learning research: resampling procedures, evolutionary algorithms and neural networks. The approach carries out an adequate combination of the three elements to resolve the problem of classifying real farms. The Relative Technical Efficiency (RTE) of each farm has been determined by the Monte Carlo Data Envelopment Analysis (MC-DEA) model. It is important to note that the classification problem considered is within the scope of imbalanced multi-classification problems.
In general, the results obtained show that the approaches proposed, which are based on MLPs trained with HAs are robust enough to tackle the multi-classification of RTE in real farms, and obtain better results than the majority of existing alternative methods.
There are two future research directions suggested by this study: (i) a multi-objective approach considering both MS and C functions could be carried out; and (ii), since the (MS, C) measures are independent of the evolutionary algorithm and of the base classifier used, other types of base classifiers and evolutionary algorithms could be considered.