اکتشاف رده کمیاب
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|20218||2014||14 صفحه PDF||سفارش دهید||13687 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 41, Issue 9, July 2014, Pages 4197–4210
Rare category discovery aims at identifying unlabeled data examples of rare categories in a given data set. The existing approaches to rare category discovery often need a certain number of labeled data examples as the training set, which are usually difficult and expensive to acquire in practice. To save the cost however, if these methods only use a small training set, their accuracy may not be satisfactory for real applications. In this paper, for the first time, we propose the concept of rare category exploration, aiming to discover all data examples of a rare category from a seed (which is a labeled data example of this rare category) instead of from a training set. To this end, we present an approach known as the FRANK algorithm which transforms rare category exploration to local community detection from a seed in a kNN (k-nearest neighbors) graph with an automatically selected k value. Extensive experimental results on real data sets verify the effectiveness and efficiency of our FRANK algorithm.
Rare categories (a.k.a. rare classes) are pervasive in real life. In medical diagnoses, although most of the diseases are well learned by current medicine, doctors still encounter rare diseases which may be new to the world. In the area of finances (Bay, Kumaraswamy, Anderle, Kumar, & Steier, 2006), the majority of transactions are legitimate, but there are still several fraudulent ones. In many applications, a data example of a rare category which cannot be classified into any of the known majority categories is usually discovered by an accident or via a detection approach such as rare category detection (He and Carbonell, 2007, He and Carbonell, 2009, He et al., 2008, Huang et al., 2011, Huang et al., 2013, Pelleg and Moore, 2004 and Vatturi and Wong, 2009) which aims at finding out at least one data example from each rare category in an unlabeled data set. When we happen to discover a data example of a rare category, an intuitive idea is how to find out the remaining data examples of the rare category based on this discovered one as a seed. In this paper, we refer to this issue as rare category exploration. Rare category exploration has various practical applications, especially when we are only interested in data examples from the same rare category of a given data example. Besides the above areas of medical discovery and financial security, rare category exploration can also discover a series of similar feature changes on the surface of the earth through mass remote sensing images based on a specified one (Porter, Hush, Harvey, & Theiler, 2010), distinguish near-duplicate copies of a labeled spam image from a large number of non-spam images (Wang, Josephson, Lv, Charikar, & Li, 2007), and identify all other malicious activities with the same pattern of a detected network intrusion in a huge-volume traffic data set (Wu, Xiong, Wu, & Chen, 2007). Among the existing research, although the imbalanced classification (e.g., Wu et al., 2007) and semi-supervised learning (e.g., Zhou, Weston, Gretton, Bousquet, & Schölkopf, 2003) can also be used to identify data examples of rare categories from those of majority categories, these two tasks are slightly different from rare category exploration from the following two aspects. (1) First, rare category exploration focuses on accurately discovering all other data examples of a rare category from only one seed which is a single labeled data example of this rare category, while imbalanced classification and semi-supervised learning often need some labeled data examples to achieve better classification performance in practice. (2) Second, rare category exploration only focuses on data examples with rare category characteristics such as compactness ( He and Carbonell, 2007, He and Carbonell, 2009, He et al., 2008, He et al., 2010, Huang et al., 2013, Huang et al., 2011 and Vatturi and Wong, 2009) and isolation ( Hospedales et al., 2013, Huang et al., 2013, Pelleg and Moore, 2004 and Vatturi and Wong, 2009). In contrast, rather than specifically focusing on rare category discovery, most of the approaches to imbalanced classification and semi-supervised learning mainly focus on constructing a classifier that optimizes a discriminative criterion for both majority and rare categories; therefore, they usually do not take full advantage of the characteristics of rare categories ( He et al., 2010). In this paper, we propose an approach known as FRANK (Fast Rare cAtegory exploratioN using a K-nearest neighbors graph) algorithm to rare category exploration. By our approach, we firstly construct a kNN graph from a given data set with an automatically selected k value, and transform rare category exploration to local community detection from a seed in the kNN graph. During the exploration process, the local community keeps absorbing external data examples that are similar to its internal data examples until there is no more improvement of the local community quality evaluated by our proposed Compactness-Isolation (CI) metric. Finally, data examples in the local community are output as the candidate data examples of the rare category. Our contributions can be summarized as follows. (1) To the best of our knowledge, it is the first time that we explicitly identify and solve the problem of rare category exploration. (2) We propose an efficient algorithm known as FRANK for rare category exploration which outperforms the existing approaches in terms of finding out the remaining data examples of the objective rare category from a seed. (3) We provide an automatic selection of the k value to construct a suitable kNN graph for FRANK, which can help further improve the performance of our FRANK algorithm. Note that since the automatic k selection is a distance-based approach which is also under a curse of dimensionality, our FRANK algorithm is not intended for high-dimensional data sets. Experimental results demonstrate that FRANK can handle data sets with up to 20 dimensions in satisfactory performance. The remaining sections are organized as follows. We review the related work in Section 2 and present our problem statement in Section 3. In Section 4, we first introduce our assumption about rare categories, based on which we then transform our rare category exploration to local community detection in a kNN graph constructed from a given data set and propose a local community metric. In Section 5, we introduce the k selection for the kNN graph construction, and present our FRANK algorithm in Section 6, following which we report the experimental results in Section 7 before concluding the paper in Section 8.
نتیجه گیری انگلیسی
In this paper, for the first time, we have identified and solved the problem of rare category exploration. Our key contribution is twofold. (1) We have proposed a fast algorithm known as FRANK for rare category exploration which has outperformed the existing related works in terms of finding out the remaining data examples of the objective rare category from a seed. (2) We have provided an automatic selection of the kk value to construct a suitable kkNN graph for FRANK, which can help further improve the performance of our FRANK algorithm . So far the experimental results have demonstrated that FRANK has been capable of handling data sets with up to about 20 dimensions. For the next stage of study, we will focus on rare category exploration in a high-dimensional feature space. To this end, we are going to (1) adopt some distance metrics specifically designed for high-dimensional data to help the algorithm avoid the curse of dimensionality, and (2) extend the processing framework of FRANK with co-clustering techniques to support a co-selection of data examples and relevant features for each objective rare category, such that these two steps benefit from each other.