روش داده کاوی مبتنی بر دانه بندی اطلاعات برای طبقه بندی داده های نامتوازن
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|21424||2008||14 صفحه PDF||سفارش دهید||7654 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Sciences, Volume 178, Issue 16, 15 August 2008, Pages 3214–3227
Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health examination, inspection, credit fraud detection, spam identification and text mining all are faced with this situation. In this study, we present a novel model called the “Information Granulation Based Data Mining Approach” to tackle this problem. The proposed methodology, which imitates the human ability to process information, acquires knowledge from Information Granules rather then from numerical data. This method also introduces a Latent Semantic Indexing based feature extraction tool by using Singular Value Decomposition, to dramatically reduce the data dimensions. In addition, several data sets from the UCI Machine Learning Repository are employed to demonstrate the effectiveness of our method. Experimental results show that our method can significantly increase the ability of classifying imbalanced data.
In recent years, we have seen an increase in research activities in the class imbalance problem. This increased interest resulted in two workshops being held, one by AAAI (American Association for Artificial Intelligence) in 2000 and another one by International Conference on Machine Learning (ICML) in 2003. SIGKDD Explorations also published one special issue in 2004. The problem is caused by imbalanced data, in which one class is represented by a large number of examples while the other is represented by only a few . Imbalanced data will result in a significant bottleneck in the performance attainable by standard learning methods  and  which assume a balanced class distribution as shown in Fig. 1. It is regarded as one of the most relevant topics of future machine learning researches.When learning from imbalanced data, traditional data mining methods tend to produce high predictive accuracy for the majority class but poor predictive accuracy for the minority class  and . That is because traditional classifiers seek accurate performance over a full range of instances. They are not suitable to deal with imbalanced learning tasks , ,  and  since they tend to classify all data into the majority class, which is usually the less important class. Fig. 2 illustrates this situation. If data mining approaches cannot classify minority examples such as medical diagnoses of an illness, or the abnormal products of inspection data, the extracted knowledge becomes meaningless and useless. Recently, this problem has been recognized in a large number of real world domains, like medical diagnosis , inspection of finished products , identifying the cause of power distribution faults , surveillance of nosocomial infections , prediction of the localization sites of protein , speech recognition , credit assessment , and functional genomic applications .To address the class imbalance problem, two major groups of techniques are proposed in the available literature. The first group involves five approaches: (1) under-sampling, a method in which the minority population is kept intact, while the majority population is under-sampled; (2) over-sampling, methods in which the minority examples are over-sampled so that the desired class distribution is obtained in the training set ,  and ; (3) cluster based sampling, methods in which the representative examples are randomly sampled from clusters ; (4) moving the decision threshold, methods in which the researcher tries to adapt the decision thresholds to impose bias on the minority class ,  and  and (5) adjust costs matrices, a method in which the prediction accuracy is improved by adjusting the cost (weight) for each class . Besides, Liu et al.  also presented a weighted rough set method for this problem. However, all of these techniques have some disadvantages . For instance, the computational load is increased and overtraining may occur due to replicated samples in the case of over-sampling. Under-sampling does not take into account all available training data which corresponds to loss of available information. Huang et al.  indicated that these supervised methods lack a rigorous and systematic treatment of the imbalanced data. The second group is related to Granular Computing (GrC) models. These GrC models  and  which copy the human instinct of information processing can increase classification performance by improving the class imbalance situation. However, these models use the concept of sub-attributes to describe Information Granules (IGs) which are collections of objects arranged together based on their similarity, functional adjacency and indistinguishability , ,  and . When handling continuous data, the drawback of sub-attributes is that computational loads will increase dramatically due to the generation of a huge number of sub-attributes. Therefore, by introducing the Latent Semantic Indexing (LSI) based feature-extraction technique, this study proposes a novel GrC model called the “Information Granulation Based Data Mining Approach” to solve the class imbalance problem. In addition, for highly skewed data, we present a new IG construction strategy which only builds IGs from majority examples and keeps minority instances intact. Finally, the experimental results show the superiority of our method for classifying imbalanced data.
نتیجه گیری انگلیسی
Can our method always provide an optimal solution for class imbalance problem? For which situation is our method suitable to be applied? In fact, it was in order to answer these questions that we attempted to validate two ideas in Sections 4.3 and 4.4. The first idea, described in Section 4.3 was to improve an imbalanced situation by considering IGs instead of numerical data. Originally, we thought that the “within-variance” of each class data might be the key factor. If the within-variance of the majority class is smaller than that of the minority class, then considering IGs (clusters) can indeed improve a skewed situation. Therefore, we consider the coefficient of variation (VC) which is a measure of dispersion of a probability distribution. It is defined as the ratio of the sample standard deviation σ to the mean μ: equation(7) View the MathML sourceVC=σμ Turn MathJax on If the coefficient of the minority class is larger than the majority class, it means the within-variance of the minority class is larger than that of the majority class. From Table 12, we can find that the coefficients of the minority class of Credit Screening and Pima are larger than those of the majority class. Compared with those methods which operate numerical data, if we consider IGs which are constructed by gathering similar objects together, it can indeed improve an imbalanced situation. The results in Section 4.3 proved it as well. However, we may encounter some situations in which the coefficient of variation of the minority class is less or equal to that of the majority. Therefore, the second idea was to propose a new IG construction strategy for this situation. As we know, the process of information granulation will reduce some detailed information. Of course, the reduction comes from both majority and minority instances. In order to save the information of the minority instances and improve the class imbalanced situation, our proposed strategy described in Section 4.4 was to build IGs merely from majority examples and keep the minority examples intact. This technique does not merely save valuable information of minority instances, but it also improves the skewed class situation. The results of Pima I, Pima II, and BSWD confirmed the benefits of our proposed strategy. To sum up, in this study a novel granular computing model called the “information granulation based data mining approach” was proposed for classifying imbalanced data. Experimental results showed that extracting knowledge from IGs has some benefits over building classifiers from numerical data. Without considering class distribution, the advantages of our method include a slight better overall accuracy and a much faster execution time than the numerical computing models. The results also show that our proposed method might be a possible solution of class imbalance problems. It has an impressive ability to improve classification performance and can dramatically increase the performances of classifying all instances, including majority and minority examples. In addition, this study indicates that introducing the LSI based feature extraction technique (SVD) into the information based data mining model will indeed reduce the amount of sub-attributes. It not only improves the classification performance, but it also saves much execution time and storage space.