دانلود مقاله ISI انگلیسی شماره 29509
ترجمه فارسی عنوان مقاله

برخی از مسائل در مورد تشخیص پرتی در نظریه مجموعه راف

عنوان انگلیسی
Some issues about outlier detection in rough set theory
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
29509 2009 8 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Expert Systems with Applications, Volume 36, Issue 3, Part 1, April 2009, Pages 4680–4687

ترجمه کلمات کلیدی
تشخیص پرتی - مجموعه های دقیق - متریک فاصله -
کلمات کلیدی انگلیسی
Outlier detection, Rough sets, Distance metric, KDD,
پیش نمایش مقاله
پیش نمایش مقاله  برخی از مسائل در مورد تشخیص پرتی در نظریه مجموعه راف

چکیده انگلیسی

One person’s noise is another person’s signal” (Knorr, E., Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th VLDB conference, New York (pp. 392–403)). In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers – objects which behave in an unexpected way or have abnormal properties. Detecting such outliers is important for many applications such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. And outlier detection is critically important in the information-based society. In this paper, we discuss some issues about outlier detection in rough set theory which emerged about 20 years ago, and is nowadays a rapidly developing branch of artificial intelligence and soft computing. First, we propose a novel definition of outliers in information systems of rough set theory – sequence-based outliers. An algorithm to find such outliers in rough set theory is also given. The effectiveness of sequence-based method for outlier detection is demonstrated on two publicly available databases. Second, we introduce traditional distance-based outlier detection to rough set theory and discuss the definitions of distance metrics for distance-based outlier detection in rough set theory.

مقدمه انگلیسی

Knowledge discovery in databases (KDD), or data mining, is an important issue in the development of data- and knowledge-based systems. Usually, knowledge discovery tasks can be classified into four general categories: (a) dependency detection, (b) class identification, (c) class description, and (d) outlier/exception detection (Knorr & Ng, 1998). In contrast to most KDD tasks, such as clustering and classification, outlier detection aims to find small groups of data objects that are exceptional when compared with the remaining large amount of data, in terms of certain sets of properties. For many applications, such as fraud detection in E-commerce, it is more interesting to find the rare events than to find the common ones, from a knowledge discovery standpoint. Studying the extraordinary behaviors of outliers can help us uncover the valuable information hidden behind them. Recently, researchers have begun focusing on outlier detection, and attempted to apply algorithms for finding outliers to tasks such as fraud detection (Bolton & Hand, 2002), identification of computer network intrusions (Eskin et al., 2002 and Lane and Brodley, 1999), data cleaning (Rulequest Research), detection of employers with poor injury histories (Knorr, Ng, & Tucakov, 2000), and peculiarity-oriented mining (Zhong, Yao, Ohshima, & Ohsuga, 2001). Outliers exist extensively in the real world, and are generated from different sources: a heavily tailed distribution or errors in inputting the data. While there is no single, generally accepted, formal definition of an outlier, Hawkins’ definition captures the spirit: “an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980 and Knorr and Ng, 1998). With increasing awareness on outlier detection in the literatures, more concrete meanings of outliers are defined for solving problems in specific domains. Nonetheless, most of these definitions follow the spirit of Hawkins’ definition (Chiu & Fu, 2003). Roughly speaking, the current approaches to outlier detection can be classified into the following five categories (Kovács, Vass, & Vidács, 2004): (1) Distribution-based approach is the classical method in statistics. It is based on some standard distribution models (Normal, Poisson, etc.), and those objects which deviate from the model are recognized as outliers ( Rousseeuw & Leroy, 1987). Its greatest disadvantage is that the distribution of the measurement data is unknown in practice. Often a large number of tests are required in order to decide which distribution model the measurement data follow (if there is any). (2) Depth-based approach is based on computational geometry and computes different layers of k–d convex hulls and flags objects in the outer layer as outliers ( Johnson, Kwok, & Ng, 1998). However, it is a well-known fact that the algorithms employed suffer from the dimensionality curse and cannot cope with a large k. (3) Clustering approach classifies the input data. It detects outliers as by-products ( Jain, Murty, & Flynn, 1999). However, since the main objective is clustering, it is not optimized for outlier detection. (4) Distance-based approach was originally proposed by Knorr and Ng ( Knorr and Ng, 1998 and Knorr et al., 2000). An object o in a data set T is a distance-based outlier if at least a fraction p of the objects in T are further than distance D from o. This outlier definition is based on a single, global criterion determined by the parameters p and D. Problems may occur if the parameters of the data are very different from each other in different regions of the data set. (5) Density-based approach was originally proposed by Breunig, Kriegel, Ng, and Sander (2000). A Local Outlier Factor (LOF) is assigned to each sample based on its local neighborhood density. Samples with high LOF value are identified as outliers. The disadvantage of this solution is that it is very sensitive to parameters defining the neighborhood. Rough set theory, introduced by Zdzislaw Pawlak in the early 1980s (Pawlak, 1982, Pawlak, 1991 and Pawlak et al., 1995), is for the study of intelligent systems characterized by insufficient and incomplete information. It is motivated by practical needs in classification and concept formation. The rough set philosophy is based on the assumption that with every object of the universe there is associated a certain amount of information (data, knowledge), expressed by means of some attributes. Objects having the same description are indiscernible. In recent years, there has been a fast growing interest in rough set theory. Successful applications of the rough set model in a variety of problems have demonstrated its importance and versatility. To the best of our knowledge, there is no existing work about outlier detection in rough set community. The aim of this work is to combine the rough set theory and outlier detection to show how outlier detection can be done in rough set theory. We suggest two different ways to achieve this aim. First, we propose sequence-based outlier detection in information systems of rough set theory. Second, we introduce traditional distance-based outlier detection to rough set theory. This paper is organized as follows. In the next section, we introduce some preliminaries in rough set theory and outlier detection. In Section 3, we give some definitions concerning sequence-based outliers in information systems of rough set theory. The basic idea is as follows: Given an information system IS=(U,A,V,f)IS=(U,A,V,f), where U is a non-empty finite set of objects, A is a set of attributes, V is the union of attribute domains, and f is a function such that for any x∈Ux∈U and a∈Aa∈A, f(x,a)∈Vaf(x,a)∈Va. Since each attribute subset B⊆AB⊆A determines an indiscernibility (equivalence) relation IND(B)IND(B) on U , we can obtain the corresponding equivalence class of relation IND(B)IND(B) for every object x∈Ux∈U. If we decrease attribute subset B gradually, then the granularity of partition U/IND(B)U/IND(B) will become coarser, and for every object x∈Ux∈U the corresponding equivalence class of x will become bigger. So when there is an object in U whose equivalence class always does not vary or only increases a little in comparison with those of other objects in U, then we may consider this object as a sequence-based outlier in U with respect to IS. An algorithm to find sequence-based outliers is also given. In Section 4, we apply traditional distance-based outlier detection to rough set theory. Since classical rough set theory is better suited to deal with nominal attributes, we propose the revised definitions of two traditional distance metrics for distance-based outlier detection in rough set theory – overlap metric and value difference metric in rough set theory, both of which are especially designed to deal with nominal attributes. Experimental results are given in Section 5, and Section 6 discusses the advantages of our sequence-based approach by comparing with other approaches to outlier detection. Section 7 concludes the paper.

نتیجه گیری انگلیسی

Outlier detection is becoming critically important in many areas. In this paper, we extended outlier detection to Pawlak’s rough set theory, which has become a popular method for KDD, much due to its ability to handle uncertain and/or incomplete data. First, we proposed a new sequence-based outlier detection method in information systems of rough set theory. Experimental results on real data sets demonstrate the effectiveness of our method for outlier detection. Next, we introduced Knorr and Ng’s distance-based outlier detection to rough set theory. In order to calculate the distance between any two objects in an information system, we gave two revised definitions of the traditional distance metrics for nominal attributes in rough set theory. In the future work, for the sequence-based outlier detection algorithm, we shall consider using reducts to have smaller number of attributes while preserving the performance of it. For the distance-based outlier detection, since the proposed measures are all global ones, in the future work, we shall consider using more local versions to get better performance for distance-based method.