الگوریتم ژنتیک در ویژگی و انتخاب عنوان مثال
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|8103||2013||8 صفحه PDF||سفارش دهید||6060 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Knowledge-Based Systems, Volume 39, February 2013, Pages 240–247
Feature selection and instance selection are two important data preprocessing steps in data mining, where the former is aimed at removing some irrelevant and/or redundant features from a given dataset and the latter at discarding the faulty data. Genetic algorithms have been widely used for these tasks in related studies. However, these two data preprocessing tasks are generally considered separately in literature. It is unknown what the performance differences would be when feature and instance selection and feature or instance selection are performed individually. Therefore, the aim of this study is to perform feature selection and instance selection based on genetic algorithms using different priorities to examine the classification performances over different domain datasets. The experimental results obtained from four small and large scale datasets containing various numbers of features and data samples show that performing both feature and instance selection usually make the classifiers (i.e., support vector machines and k-nearest neighbor) perform slightly poorer than feature selection or instance selection individually. However, while there is not a significant difference in classification accuracy between these different data preprocessing methods, the combination of feature and instance selection largely reduces the computational effort of training the classifiers, as opposed to performing feature and instance selection individually. Considering both classification effectiveness and efficiency, we demonstrate that performing feature selection first and instance selection second is the optimal solution for data preprocessing in data mining. Both SVM and k-NN classifiers provide similar classification accuracy to the baselines (i.e., those without data preprocessing). The decisions regarding which data preprocessing task to perform for different dataset scales are also discussed.
The process of knowledge discovery in databases (KDD), or data mining, generally involves a number of steps, such as dataset selection, data preprocessing, data analysis, and result interpretation and evaluation  and . Data preprocessing is one of the most important steps with the aim of making the chosen dataset as ‘clean’ as possible for eventual analysis and evaluation. In other words, quality mining results cannot be obtained if the data quality is low  and . Feature selection (or dimensionality reduction) and instance selection (or record reduction) are two of the more active preprocessing problems in data mining. This is because the number of features and data samples selected is usually very large in most real-world data mining problems. If too many instances are considered, it can result in large memory requirements, high disk access, slow execution speed, and a possible over-sensitivity to noise . In addition, it is often the fact that data are not all equally informative and some data points will be further away from the sample mean than what is deemed reasonable. Similarly record reduction is aimed at discarding faulty data (or outliers), which could be considered as noisy points lying outside a set of defined clusters and could lead to significant performance degradation  and . Data mining tasks such as classification or prediction performance that is carried out without considering the instance selection step will very likely lead to poorer results  and . On the other hand, if too many features are used for data analysis, it can cause the curse of dimensionality problems . Since not all of the pre-chosen features are informative, the objective of feature selection is to select more representative features which have more discriminative power over a given dataset. This is also called dimensionality reduction . In the literature, many related studies have shown promising results for feature selection and instance selection approaches , , ,  and . However, up until now, the focus has been on either selecting more representative features or reducing faulty data, as it relates to effective classification or prediction. This leads to the important research question about which step (i.e., feature selection or instance selection) should be performed first when both steps are critical to improving the mining performance. For many relevant and large scale datasets, both data preprocessing steps need to be performed. This is because in many domain problems there is usually no exact agreed upon number of variables, and all of those collected for a specific domain may not be informative. Furthermore, some data samples in a given large dataset may be regarded as noisy. Therefore, feature selection and instance selection should both be considered in order to develop a more effective model  and . Genetic algorithms (GAs) comprise one of the most widely used techniques for feature and instance selection, and can improve the performance of data mining algorithms , , , , ,  and . In particular, Cano et al.  have shown that better results can be obtained with GAs than with many traditional and non-evolutionary instance selection methods in terms of better instance selection rates and higher classification accuracy. Moreover, GAs have been shown to be suitable for large-scale feature selection problems . However, very few consider feature selection and instance selection together using a GA over a given dataset. For example, given a dataset D containing m dimensional features and i data samples, using feature selection and instance selection as the first and second preprocessing steps respectively, will lead to D1 containing n dimensional features and j data samples (where 0 < n < m and 0 < j < i). On the other hand, if the operations are performed in reverse order, different results can be obtained. The aim of this study is to perform feature selection and instance selection based on genetic algorithms using different priorities and to examine the classification performances over different domain datasets. In addition, the results will be compared, where a dataset is created without considering both data preprocessing steps, by feature selection only, and a dataset by instance selection only. The rest of this paper is organized as follows. Section 2 describes the concept of feature selection and instance selection. In addition, genetic algorithms are overviewed in terms of data preprocessing. Section 3 presents the research design and experimental results. Finally, some conclusions are offered in Section 4.
نتیجه گیری انگلیسی
Feature selection and instance selection are two important data preprocessing steps in the data mining process. The main goal of conducting each of these two steps is to make a given dataset ‘cleaner’ and/or ‘more representative’ by filtering out irrelevant features and noisy data samples in order to obtain good quality mining results. This is the first attempt to assess the performance of using genetic algorithms to perform feature and instance selection steps of different priorities over different domain problems. In particular, there are four different data preprocessing approaches: instance selection, feature selection + instance selection, and instance selection + feature selection. The small-scale experimental results show that performing feature selection first and instance selection second can make the classifiers provide slightly better classification results than performing instance selection first and feature selection second. However, the classifiers utilizing a combination of feature and instance selection perform slightly more poorly than the ones using feature selection or instance selection individually. On the other hand, in the large-scale experiments, the classifiers sometimes perform better based on a combination of feature and instance selection than those based on feature and instance selection alone. However, it is hard to figure out the winner of these four different data preprocessing steps since there is not a big difference between them. Consequently, the computational cost of training classifiers becomes another important indicator to assess these data preprocessing methods. The time complexity analysis shows that the combination of feature and instance selection greatly reduces the computational cost of training classifiers. As a result, it can be seen that the combination of feature and instance selection is a suitable solution for data preprocessing on large data sets. More specifically, performing feature selection first and instance selection second will allow the SVM and k-NN classifiers to provide similar classification accuracies to the ones without data preprocessing. Our research findings provide guidelines for implementing suitable baselines for performing feature and instance selection. Several issues can be considered in the future. First of all, the baselines can be compared with existing algorithms performing feature and instance selection simultaneously in terms of classification accuracy and computational complexity. Secondly, it would be useful to examine the performance of combining different feature and instance selection algorithms as a hybrid approach. For example, the feature selection task can be based on the genetic algorithm, principal component analysis, or information gain, etc., whereas the instance selection task can be approached by methods such as DROP 3 ,  and .