یک الگوریتم ژنتیکی افزایشی برای طبقه بندی و تجزیه و تحلیل حساسیت پارامترهای آن
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
26458 | 2011 | 12 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 38, Issue 3, March 2011, Pages 2609–2620
چکیده انگلیسی
Traditionally, data mining tasks such as classification and clustering are performed on data warehouses. Usually, updates are collected and applied to the data warehouse frequent time periods. For this reason, all patterns derived from the data warehouse have to be updated frequently as well. Due to the very large volumes of data, it is highly desirable to perform these updates incrementally. This study proposes a new incremental genetic algorithm for classification for efficiently handling new transactions. It presents the comparison results of traditional genetic algorithm and incremental genetic algorithm for classification. Experimental results show that our incremental genetic algorithm considerably decreases the time needed for training to construct a new classifier with the new dataset. This study also includes the sensitivity analysis of the incremental genetic algorithm parameters such as crossover probability, mutation probability, elitism and population size. In this analysis, many specific models were created using the same training dataset but with different parameter values, and then the performances of the models were compared.
مقدمه انگلیسی
Data mining is the process of extracting hidden patterns from large datasets. Data mining has been widely used in many areas, such as marketing, banking and finance, medicine and manufacturing. Main data mining tasks and methods include classification, clustering, association rule mining, sequential pattern mining and outlier detection. Classification is a procedure in which individual items are placed into groups based on quantitative information on some characteristics inherent in the items. In the classification process, a collection of labelled classes is provided and a training set is used to learn the descriptions of classes. Classification rules are discovered and then these rules are used to determine the most likely label of a new pattern. The most widely used classification techniques are neural networks, decision trees, k-nearest neighbours, support vector machines, and Naive Bayes. Although, genetic algorithm (GA) is not one of the most widely used classifiers, several studies (Kim and Ryu, 2005, Nath et al., 2005 and Zhu and Guan, 2004) show that this technique can also be successfully used for classification. These works include traditional genetic algorithm and don’t support the incremental usage of the models when new data is added to the existing dataset. In addition, they don’t evaluate the performance of the models which are constructed by different values of GA parameters. A data warehouse is typically not updated immediately when insertions on the operational databases occur. New records are collected over a period of time and then data warehouse is updated at the end of this period. After that, genetic algorithm should be re-run again on the whole dataset in order to discovery classification rules again. Due to the very large size of the datasets, it is highly desirable to run this algorithm incrementally. This study focuses on the problem of incrementally updating classification rules on changes of the database. It introduces a new incremental genetic algorithm for classification. The purpose of this study is to reduce the execution time of training process when new transactions are inserted. Based on the experimental results, it can be proven that the incremental GA yields significant speed-up factors over traditional GA even for large numbers of new transactions in a data warehouse. The performance evaluation of incremental GA on the “Nursery” dataset is presented, demonstrating the efficiency of the proposed algorithm. This study also presents the sensitivity analysis of the GA parameters such as crossover probability, mutation probability, with/without elitism and population size. The aim of this analysis is to evaluate the performances of the classification models which are constructed using the same training dataset with different GA parameter values. Each classification model (classifier) consists of input parameters (crossover and mutation probabilities, population size etc.), applied techniques (parent selection type, crossover type, different termination criteria etc.), and outputs (average fitness value, classification rules etc.) related to training process. The models are compared by applying n-fold cross validation method. In order to implement all these experiments, we developed the tool, named Generic Genetic Classifier Tool. The rest of this paper is organized as follows: Section 2 includes related works on traditional genetic algorithm and classification using genetic algorithm. Section 3 briefly introduces the principles and basic concepts of incremental GA-based classification. Section 4 presents the new algorithm which can be used to incrementally update the classification rules on insertions of the database Section 5 reports an extensive performance evaluation and experimental results. Section 6 concludes with a summary and future work.
نتیجه گیری انگلیسی
Classification is a data mining task that assigns items in a dataset to target classes. Although genetic algorithms are less commonly used for classification in commercial data mining systems, this technique shows its strength in certain applications. In this study, an incremental GA-based classification is proposed for efficiently handling new transactions added into the dataset. The purpose of the algorithm is to decrease time needed for training to construct a new classifier with new dataset. Numerous trials of incremental GA were run for the classification problem to determine the effects of various parameters on the performance of the algorithm. Because of the stochastic nature of the algorithm, several runs were made at each parameter setting to obtain an average. Parameters included crossover probability, mutation probability, with/without elitism, and population size. In addition, the performances of two different GA implementations (traditional and incremental) were compared according to the two different termination criteria (generation number and fitness value). All these experimental results shows the reduction in average training time and improvement in score over the population and best individual score within the population at each generation. Incremental GA gives much better performance by adding rules into initial population before training operation when new transactions are added into the dataset. Genetic algorithms are easily parallelizable when using for classification. In future studies, incremental GA can be run in parallel to make training for different classes at the same time. For example; if there are five classes in the dataset then five computers can create an initial population, make training operation and find classification rule(s) for one class at the same time. After that, all classification rules found by different computers can be gathered together to form complete classifier model.