تبدیل متغیرهای قطعی به متغیرهای عددی از طریق طبقه بندی شبکه های بیزی برای طبقه بندی دودویی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
29012 | 2010 | 19 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 54, Issue 5, 1 May 2010, Pages 1247–1265
چکیده انگلیسی
Many pattern classification algorithms such as Support Vector Machines (SVMs), Multi-Layer Perceptrons (MLPs), and K-Nearest Neighbors (KNNs) require data to consist of purely numerical variables. However many real world data consist of both categorical and numerical variables. In this paper we suggest an effective method of converting the mixed data of categorical and numerical variables into data of purely numerical variables for binary classifications. Since the suggested method is based on the theory of learning Bayesian Network Classifiers (BNCs), it is computationally efficient and robust to noises and data losses. Also the suggested method is expected to extract sufficient information for estimating a minimum-error-rate (MER) classifier. Simulations on artificial data sets and real world data sets are conducted to demonstrate the competitiveness of the suggested method when the number of values in each categorical variable is large and BNCs accurately model the data.
مقدمه انگلیسی
The primary goal of pattern classification is to estimate a classification function, i.e., a classifier, using labeled training patterns so that the estimated classifier will correctly assign class labels to novel test patterns. Some examples of the most widely used classification algorithms are Support Vector Machines (SVMs), Multi-Layer Perceptrons (MLPs), and K-Nearest Neighbors (KNNs). SVMs (Vapnik, 1995, Burges, 1998 and Cristianini and Shawe-Taylor, 2000) build a sparsely formulated hyperplane classifier through the maximization of a margin criterion. MLPs (Bishop, 1995 and Haykin, 1999) construct networks with two or more layers of nonlinear computation units, and the synaptic weights of each unit are estimated by a maximum likelihood estimation. KNNs (Duda et al., 2001) make a rule which classify a pattern by assigning it the label most common among its kk nearest samples. Many classification algorithms including the above examples assume that a pattern is represented as a vector of numerical values. For example, the common basic operations of those algorithms are the computations of dot products and Euclidean distances between patterns and other vectors. However, in many real world data sets a pattern is represented as a collection of discrete or structured objects. For example, a text is represented as a string of letters, gene as a sequence of nucleotides, image as a set of pixels, and so on. In this paper we concentrate on the case that a pattern is represented as a collection of only two types of values: categorical values and numerical values. That is, each of the variables in a pattern is of either categorical type or numerical type, and a categorical variable takes its values in some finite set of categories. In this case the classification algorithms such as SVMs, MLPs, and KNNs are not directly applicable, and one might have to either discard the categorical values or convert the categorical values into numerical values. One typical conversion method is to use a single number to represent a categorical value. But this method depends on an arbitrary ordering of values in a categorical variable. Alternatively, Hsu et al. (2003) suggest to use mm binary numbers to represent a mm-category variable. Hsu et al. (2003) remark that if there are not too many values in a categorical variable, the method is more stable than using a single number to represent a categorical variable. On the other hand, there have been many researches on designing kernel functions for various structured data (Gärtner, 2003 and Shawe-Taylor and Cristianini, 2004). A kernel function is a measure of meaningful similarities between a pair of patterns (Schölkopf and Smola, 2002, Ch.2), and appropriately selected kernel functions have led to improvements in classification performances (Vapnik, 1995, Joachims, 1998, Chapelle et al., 1999 and Pavlidis et al., 2002). The Fisher kernel (Jaakkola and Haussler, 1999) and the marginalized kernel (Tsuda et al., 2002) are typical kernels defined from probabilistic models such as Hidden Markov Models (HMMs), and both of them have achieved remarkable improvements in biological sequence classifications. This implies that defining a kernel on a probabilistic model is a useful way of incorporating prior knowledge and manipulating structured data. In this paper we propose a new method of converting mixed data of categorical and numerical values into data of numerical values by defining a kernel function from a probabilistic model. First we define an ideal kernel function for binary classification problems based on the definition of a minimum-error-rate (MER) classifier. Second we propose to use Bayesian Network Classifiers (BNCs) (Friedman et al., 1997) to accurately estimate the ideal kernel function. The estimation using BNCs allows an effective modeling of the categorical variables, and it is computationally efficient and robust to noises and data losses. Third we show that the ideal kernel function is decomposed into products of simpler kernel functions. This decomposition enables us to explicitly present the conversion of original mixed data into numerical data. Since the suggested method uses a small number of real numbers to represent a categorical value, there is not much increases in dimensions of patterns regardless of the number of values in a categorical variable. Moreover a simple linear classifier can approximate the MER classifier using the converted numerical values as far as the estimation by BNCs is accurate. This paper is organized as follows. In Section 2 we describe the mixed data of categorical and numerical variables, and we introduce basic properties of a kernel function. In Section 3 we define the ideal kernel and the MER classifier. In Section 4 the estimation of probabilities for the mixed data using BNCs is explained. In Section 5 the decomposition of the ideal kernel is described, and the explicit conversion of the mixed data into numerical data is proposed. In Section 6 we present simulation results on artificial data sets and real world data sets comparing the suggested method with the other typical methods. We discuss about the results and future researches in Section 7.
نتیجه گیری انگلیسی
In this paper we suggested effective methods of converting categorical variables into numerical variables for the classification of mixed data of categorical and numerical variables. We suggested to use a probabilistic model based kernel function to model the mixed data, and defined the ideal kernel with respect to the minimum-error-rate classifier. Since the ideal kernel function is defined in terms of the posterior probability, the Bayesian network classifiers such as naive Bayesian classifiers and tree-augmented naive Bayesian classifiers were applied in modeling the posterior probability. The learning of the Bayesian network classifiers is computationally efficient and the smoothed parameter estimation method is robust to noises and data losses. Moreover, the suggested decomposition of the ideal kernel explicitly identifies the features extracted from the mixed data. Using the extracted numerical features we can easily reconstruct the estimated minimum-error-rate classifier, and the increase in dimensionality of input patterns is small and controllable. From the simulations we can conclude that the suggested methods can be good alternatives to the typical methods. Especially, the SNR and BNR methods have degraded performances when the number of values in categorical variables is large and the order of the values are irrelevant to the class labels, or the classification algorithms are not properly selected. On the other hand, the suggested methods such as PNR.NB1 and PNR.TAN1 show good performances when NBs and TANs well model the data and the classification algorithms are properly selected. Designing effective conversions for mixed data requires effective modeling of the data. Therefore, researches on more accurate generative models besides Bayesian network classifiers are important challenges. Also, preprocessing and modeling other various kinds of structured data such as strings and images are important problems.