مدل جدید خوشه بندی فازی با استفاده از c-ابزار بر اساس روش وزنی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
15545 | 2011 | 20 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Data & Knowledge Engineering, Volume 69, Issue 9, September 2010, Pages 881–900
چکیده انگلیسی
This paper proposes a new kind of data weighted fuzzy c-means clustering approach. Different from most existing fuzzy clustering approaches, the data weighted clustering approach considers the internal connectivity of all data points. An exponent impact factors vector and an influence exponent are introduced to the new model. Together they influence the clustering process. The data weighted clustering can simultaneously produce three categories of parameters: fuzzy membership degrees, exponent impact factors and the cluster prototypes. A new fuzzy algorithm, DWG-K, is developed by combining the data weighted approach and the G-K. Two groups of numerical experiments were executed. Group 1 demonstrates the clustering performance of the DWG-K. The counterpart is the G-K. The results show the DWG-K can obtain better clustering quality and meanwhile it holds the same level of computational efficiency as the G-K holds. Group 2 checks the ability of the DWG-K in mining the outliers. The counterpart is the well-known LOF. The results show the DWG-K has considerable advantage over the LOF in computational efficiency. And the outliers mined by the DWG-K are global. It was pointed out that the data weighted clustering approach has its unique advantages when mining the outliers of the large scale data sets, when clustering the data set for better clustering results, and especially when these two tasks are done simultaneously.
مقدمه انگلیسی
Artificial intelligence research and application involve a number of sub-areas. This paper discusses two of them: cluster-based pattern recognition and outlier mining. For a given data set, cluster-based pattern recognition refers to dividing the data set into several patterns using the cluster methods. Outlier mining refers to finding the abnormal data points in the data set and mining the information that they contain [1]. On one hand, cluster-based pattern recognition and outliers mining hold close connectivity; the outliers in the data set must be detected and appropriately processed, for example, replaced by the normal points or directly eliminated when necessary. The goal is to reduce their negative influences on the results of the cluster-based pattern recognition [2]. On the other hand, the treatment of the outliers is obviously different for cluster-based pattern recognition and outlier mining. For the cluster-based pattern recognition, outliers are usually regarded as “harmful”, the main measures taken here are to minimize or eliminate this harm and the outliers are detected as the by-products of the clustering. While for the outlier mining, the outlier itself becomes the focus. The main task is to mine, not only to detect the outliers. In addition, the methods used in these two areas are different under most circumstances. For the outlier mining, the main methods are based on statistics, density, distance, and feature deviation [1]. Cluster-based methods have been reported but they are not very popular. In reality, there is a common demand that can be summarized as follows: for a given data set which contains a certain number of outliers, the analysis of the data set simultaneously involves three tasks. The first is to cluster the data set. The data set is clustered into several groups, and then the belonging of each data point to the prototypes is subsequently determined. The second is to establish the classifiers. The classifiers are established by the cluster prototypes. Lastly, to mine the outliers. This includes detecting the outliers and discovering the information that they contain. Currently, these tasks are solved in different areas. In real applications, these three tasks are often expected to be solved simultaneously. Little research about solving these three tasks simultaneously has been reported. In order to solve the above problems, this paper proposes a new fuzzy clustering approach, which is called data weighted fuzzy clustering approach. The core idea of this novel approach is the nature of each data point in the data set is “different” from one another, and internal connectivity exists among all data points. A constraint which describes the internal connectivity is given in the data weighted fuzzy clustering approach. A set of exponent impact factors and an influence exponent are introduced to the novel objective function. Together they influence the clustering process and realize the goal of treating each data point differently. Because the data weighted clustering approach considers the internal connectivity of all data points, it holds a strong ability to handle the outliers. When the data weighted approach is used to cluster the data set, not only can it get much better clustering qualities than the existing clustering models do, it can also effectively detect the outliers and easily mine information related to the outliers. In contrast, most existing fuzzy clustering models neglect the internal connectivity of all data points. They treat them equally in the process of clustering. This paper gives the theoretical model of the data weighted fuzzy clustering approach, and numerical experiments are given to verify the performance of the data weighted fuzzy clustering approach in clustering and mining outliers, particularly when they are done simultaneously. This paper is presented as follows: Section 2 reviews related work carried out on the existing fuzzy clustering and outlier mining. Section 3 describes the data weighted fuzzy clustering approach. First, the mathematical model and some update equations are reasoned. Second, a conventional fuzzy clustering algorithm, Gustafson–Kessel (hereinafter, in short, G-K), is introduced and is combined with the proposed data weighted approach, as a result, the data weighted G-K algorithm, DWG-K for short, is developed. Section 4 tests and verifies the DWG-K by the numerical experiments on two real data sets. The experiments have been divided into two groups, group one verifies the clustering performance of the DWG-K, compared with the G-K. Group two verifies the ability of the DWG-K in mining the outliers, compared with the LOF. Section 5 discusses the roles of two new parameters: the exponent impact factors and the influence exponent. Section 6 concludes the research results.
نتیجه گیری انگلیسی
This paper reports a new data weighted fuzzy clustering approach. Its innovation lies in a kind of internal connectivity between all data points in the data set in the process of clustering. As well, exponent impact factors vector E (EIF(j)), the EIF(j) (j = 1…n) corresponding to the data xj (j = 1…n), and an influence exponent s are introduced to the objective function of the data weighted fuzzy clustering model. The existing fuzzy clustering methods mainly adjust the clustering process through the channel: (U, m). With this channel unchanged, the data weighted fuzzy clustering model introduced a second channel: (E, s). Through the two channels, there was a realization that different data points should be handled differently in the data weighted fuzzy clustering model. Additionally, the data weighted fuzzy clustering model presents a constraint to solve the problem in that “the product is equal to one”. However, if the ‘sum is one’ constraint is applied, when the size of the data set is very large, the model most likely will not run because the factors in the sum are too small. With the exception of the influence exponent s and fuzzy exponent m, which are given, the novel approach does not require other user defined parameters. E and U are updated after inputting their initial value, so the application of the novel approach does not require additional knowledge and experience. Moreover, s is a key parameter. Better clustering performance can be achieved by choosing the available s; s is also used to adjust the iterative speed. The latter is significant when mining the outliers in large scale or dynamic data sets. As an application of the novel approach, the DWG-K is developed by integrating the data weighted approach and the existing G-K. The DWG-K simultaneously outputs three categories of parameters: fuzzy membership degree matrix U, adaptive degree vector E and cluster prototype matrix V. They are respectively used to cluster, mine the outliers and establish the classifiers. The DWG-K shows its unique advantage by simultaneously achieving the above three outputs. The outliers mined by the DWG-K are global, and they express the relationship between the outliers and the whole data set. In addition, because DWG-K is basically a clustering method, the outliers mined by the DWG-K contain abundant information — for example, which clusters the outliers belong to. Therefore, the DWG-K is very suitable for mining the outliers. Finally, when the DWG-K is used to fulfill the clustering task, it holds a similar level of computational efficiency as most existing clustering models. When the DWG-K is used to fulfill the outliers mining task, it holds considerable advantage in the computation efficiency over the well-known density-based LOF. Some more efficient and targeted algorithms can potentially be developed by integrating the data weighted clustering approach and the existing clustering models. Succeeding works of the paper include studying more generalized forms of the data weighting functions. In particular, concerning the selection method of the influence exponent s, the experimental method is presented, but the mathematical reasoning and verification behind it have not been presented. This is an area to be extended upon in future research.