یک الگوریتم تکاملی هیبرید برای انتخاب ویژگی در داده کاوی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|22156||2009||15 صفحه PDF||سفارش دهید||9340 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 36, Issue 4, May 2009, Pages 8616–8630
Real life data sets are often interspersed with noise, making the subsequent data mining process difficult. The task of the classifier could be simplified by eliminating attributes that are deemed to be redundant for classification, as the retention of only pertinent attributes would reduce the size of the dataset and subsequently allow more comprehensible analysis of the extracted patterns or rules. In this article, a new hybrid approach comprising of two conventional machine learning algorithms has been proposed to carry out attribute selection. Genetic algorithms (GAs) and support vector machines (SVMs) are integrated effectively based on a wrapper approach. Specifically, the GA component searches for the best attribute set by applying the principles of an evolutionary process. The SVM then classifies the patterns in the reduced datasets, corresponding to the attribute subsets represented by the GA chromosomes. The proposed GA-SVM hybrid is subsequently validated using datasets obtained from the UCI machine learning repository. Simulation results demonstrate that the GA-SVM hybrid produces good classification accuracy and a higher level of consistency that is comparable to other established algorithms. In addition, improvements are made to the hybrid by using a correlation measure between attributes as a fitness measure to replace the weaker members in the population with newly formed chromosomes. This injects greater diversity and increases the overall fitness of the population. Similarly, the improved mechanism is also validated on the same data sets used in the first stage. The results justify the improvements in the classification accuracy and demonstrate its potential to be a good classifier for future data mining purposes.
In today’s context, data mining has developed into an important application due to the abundance of data and the imperative to extract useful information from raw data. Many useful data patterns can be selected out, which helps predict outcomes of unprecedented scenarios. The knowledge gained from data mining can also be subsequently used for different applications ranging from business management to medical diagnosis. Decision makers can hence make a more accurate assessment of situations based on this attained knowledge. Support vector machines (SVMs) have recently gained recognition as a powerful data mining technique to tackle the problem of knowledge extraction (Burges Christopher, 1998). SVMs use kernel functions to transform input features from lower to higher dimensions. Many practical applications exploit the efficiency and accuracy of SVMs, such as intrusion detection (Mukkamala, Janoski, & Sung, 2002) and bioinformatics where the input features are of very high dimensions. Data mining is an essential step in the process of knowledge discovery in databases (KDD) (Fayyad, 1997). In addition to data mining, major steps of KDD also include data cleaning, integration, selection, transformation, pattern evaluation, and knowledge presentation. Since data is frequently interspersed with missing values and noise, which makes them incoherent, data pre-processing has thus become an important step before data mining to improve the quality of the data. This subsequently improves the data mining results. Data pre-processing takes several forms, including data cleaning, data transformation, and data reduction. Data cleaning is done to remove noise in the data. Data transformation is to normalize the data. Finally, data reduction is to reduce the amount of data by aggregating values or removing and clustering redundant attributes. Removal of redundant attributes through selection of relevant attributes has become the focus of several recent search projects (Liu & Motoda, 1998). Several machine learning techniques have been around for attribute selection, including evolutionary algorithms (EAs), neural networks, and Bayes Theorem (Chang et al., 1999, Hruschka and Ebecken, 2003, Mangasarian, 2001, Tan et al., 2002 and Wong et al., 2000). Hruschka and Ebecken (2003) used the Bayesian approach to carry out attribute selection. The Markov Blanket of the class variable was used as a selection criterion. Neural networks and fuzzy logics (Benitez, Castro, Mantas, & Rojas, 2001) have also been employed for carrying out the attribute selection task. The attributes were first ranked according to a relevance measure. Attributes were then removed in an increasing order of relevance until the generalization ability of the network reached unacceptable levels. The downside of using neural networks is that they are not comprehensible to users. Furthermore, deciding the optimal number of neurons is a difficult task. EAs appear to be promising in the field of attribute selection due to their heuristical nature in a directed, stochastic search. They are based on the process of natural selection and Darwin’s theory of “survival of the fittest”, which tend to drive an objective to an optimum. Recently, EAs have been applied in attribute selection for several applications ( Martin-Bautista and Vila, 1999 and Shi et al., 1998). Pappa, Freitas, and Kaestner (2002) combined genetic algorithm (GA) and C4.5 ( Quinlan, 1992) in a multiobjective approach. Multiobjective Genetic Algorithm (MOGA) was used to select the best attribute set by minimizing the error rate and the C4.5 tree size. The results derived demonstrated that the majority of the MOGA-found solutions dominated the baseline (the set of all attributes) and were distributed evenly along the Pareto front. This justifies the ability of GA to produce good results with a wide spread due to its randomness. It is thus beneficial to investigate whether EAs and SVMs can be combined effectively to develop into a good classifier empowered by attribute selection. Based on the past successes of EAs and SVMs, they are fused in a hybrid approach to carry out both attribute selection and data classification. The workflow of this hybrid model contains two main stages. The first phase entails the selection of a set of attributes via EAs. These attributes are then passed to the SVM classifier to acquire a fitness measure for each attribute set in the second phase. These fitness values are then used in the selection of the best set of attributes based on GA. This cyclic method is known as the wrapper approach. Moreover, improvements are made by replacing unfit members of an existing population in a bid to increase the average fitness of the population and garner better results. The remainder of the paper is organized as follows. Section 2 describes the attribute selection task in data mining and the approach used. Section 3 analyzes the proposed GA-SVM hybrid algorithm in the form of a flow chart. In addition, the main characteristics of the hybrid such as the chromosome structure, population layout, and the improved correlation-based algorithm are discussed. Section 4 presents the case study, which includes the introduction of experiment datasets and simulation results. The results are then tabulated and compared with several established algorithms. The viability and usefulness of the hybrid can be observed from the results and show its prospects for future data classification. Section 5 introduces the improvement of the proposed algorithm. Finally, section 6 presents the concluding remarks
نتیجه گیری انگلیسی
This paper has proposed a hybrid evolutionary algorithm for attribute selection in data mining. The GA-SVM hybrid incorporates the stochastic nature of genetic algorithms together with the vast capability of support vector machines in the search for an optimal set of attributes. The eradication of the redundant attributes using the GA-SVM hybrid improves the quality of the data sets and enables better classification of future unseen data. The proposed GA-SVM hybrid was validated upon 5 data sets obtained from UCI machine learning repository. Results collated have shown that the proposed hybrid is able to produce a high average classification accuracy that is comparable or better than some of the established classifiers in the data mining community. The simulations carried out also showcase the statistical consistency of the GA-SVM hybrid, which is evident from the histogram analysis and box plots. Secondary improvements to the hybrid included the utilization of a correlation measure to improve the average fitness of a chromosome population. The results obtained verify that the substitution of weaker chromosomes based on the correlation measure improved the hybrid’s classification ability. This was observed from the higher classification accuracy attained upon testing on the same UCI data sets. The stability of the classifier was also enhanced as ascertained by the low variance of the results collected. The analysis hitherto has thus demonstrated the viability of the GA-SVM hybrid as a good classifier when the irrelevant attributes are removed.