دانلود مقاله ISI انگلیسی شماره 25272
ترجمه فارسی عنوان مقاله

روش برنامه ریزی خطی رمان برای ساخت یک مقطع طبقه بندی دودویی غیرخطی با دقت پیشینی

عنوان انگلیسی
Novel linear programming approach for building a piecewise nonlinear binary classifier with a priori accuracy
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
25272 2012 12 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Decision Support Systems, Volume 52, Issue 3, February 2012, Pages 717–728

ترجمه کلمات کلیدی
طبقه بندی دودویی - ماشین آلات بردار پشتیبانی - شبکه های عصبی مصنوعی - درختان طبقه بندی - برنامه ریزی خطی - تمیز دهنده تکه ای -
کلمات کلیدی انگلیسی
Binary classification, Support vector machines, Artificial neural networks, Classification trees, Linear programming, Piecewise discriminator,
پیش نمایش مقاله
پیش نمایش مقاله  روش برنامه ریزی خطی رمان برای ساخت یک مقطع طبقه بندی دودویی غیرخطی با دقت پیشینی

چکیده انگلیسی

This paper describes a novel approach to build a piecewise (non)linear surface that separates individuals from two classes with an a priori classification accuracy. In particular, total classification with a good generalization level can be obtained, provided no individual belongs to both classes. The method is iterative: at each iteration a new piece of the surface is found via the solution of a Linear Programming model. Theoretically, the larger the number of iterations, the better the classification accuracy in the training set; numerically, we also found that the generalization ability does not deteriorate on the cases tested. Nonetheless, we have included a procedure that computes a lower bound to the number of errors that will be generated in any given validation set. If needed, an early stopping criterion is provided. We also showed that each piece of the discriminating surface is equivalent to a neuron of a feed forward neural network (FFNN); so as a byproduct we are providing a novel training scheme for FFNNs that avoids the minimization of non convex functions which, in general, present many local minima. We compare this algorithm with a new linear SVM that needs no pre tuning and has an excellent performance on standard and synthetic data. Highly encouraging numerical results are reported on synthetic examples, on the Japanese Bank dataset, and on medium and small datasets from the Irvine repository of machine learning databases.

مقدمه انگلیسی

Let us first mention minor peculiarities of our notation: R n denotes the n -th dimensional Euclidean space, lowercase Latin letters are vectors in R n, x ik is the k -th component of x i, uppercase cursive letters P,NP,N, etc., denote sets in R n. Given the sets D,HD,H the difference set D−HD−H consists of those members of DD that do not belong to HH. Lowercase Greek letters denote scalars. The rest of the notation is rather standard. This paper is concerned with the binary classification problem (BCP) of determining a discriminating function h (⋅) : R n → R that separates individuals belonging to two different classes, say, class PP and class NN. This is a classical problem that still has vast applications in several areas [4]. This work solves the BCP by successive linear programming (LP) models, which generate nodes of a decision tree. Each node classifies a group of individuals of one class. We assume that an individual x is characterized by its n features x1, …, xn and its class indicator c(x), View the MathML sourcec(x)={1,x∈P−1,x∈N. Turn MathJax on We extend this definition to the class indicator of a set SS, View the MathML sourcec(S)={1,S⊆P−1,S⊆N0,otherwise. Turn MathJax on The discriminating function h(⋅):D→Rh(⋅):D→R, also called discriminating surface, decision rule or learning rule, satisfies View the MathML sourceh(x){>0,x∈P<0,x∈N. Turn MathJax on Albeit a discriminating function always exists, provided (P∩N)=∅P∩N=∅, there are practical issues that prevent the practitioners from obtaining a solution. They rather search for the best separating function that minimizes a measure of the error set EE defined as equation(1) E={x∈N:h(x)≥0}∪{x∈P:h(x)≤0}.E=x∈N:h(x)≥0∪x∈P:h(x)≤0. Turn MathJax on On top of this, practitioners must formulate a model simple enough to generate an accurate solution within a reasonable time and allocated budget. Instead of striving for a discriminating function two less ambitious objectives are pursued. Given a class of functions FF we want O1 To find the best separating function in FF for a training set (T⊆D)T⊆D. O2 To accurately predict the class indicator of those individuals in (D−T)D−T. A variety of optimization models to minimize a measure of the error set EE have been suggested. Linear programming optimization models (LPs) are mostly used, because they find a solution in polynomial time, are robust and may deal with large problems. The simplest strategy is to obtain the best linear separating function. In the late sixties, [26] proposed the multisurface method (MSM), which finds a piecewise linear discriminating function for the training set TT. A clear description of MSM is given by [33] and a more updated version of MSM is given by [29]. Least square techniques and nonlinear constrained optimization models are used to train (FFNNs), which can generate a piecewise linear discriminating surface for any two disjoint sets with a sufficiently large but unknown number of neurons in the hidden layer [17]. MSM may also be used to train a feed forward neural network (FFNN) via LP models [27, Section 2]. Support vector machines (SVMs) developed at Bell Laboratories [41] and [42] solve quadratic programming optimization models (QPs), mainly with the intention of improving the prediction ability of the solution (objective O2). SVMs use kernels, which allow the search of the best linear separating function in a space bounded by the number of individuals. Primal and Dual QP models have been tried for its solution. An account of going on research for an efficient implementation of these QP models is given in [3, Section 5]. SVMs' objective is structural risk and does not attempt total separation, which can be obtained by our approach. Hence, our algorithm will render important when this feature is indispensable for the user. Mixed integer linear programming models (MILPs) have recently been proposed whose solution gives rise to a piecewise separating surface [4], [21] and [22], but there is no hint on the number of pieces necessary to obtain a required classification accuracy. A good account on different optimization models that have been proposed in the open literature to obtain the best separating function is given by Bennett and Parrado-Hernandez [3] and by Smaoui et al. [38]. We should keep in mind that the larger the set of functions FF, the better the separating function h∈Fh∈F that minimizes the error set (1) on TT; but commonly the prediction ability on (D−T)D−T worsens. This incident has been called overfitting, and it causes a conflict between O1 and O2. A common approach is to impose a k-fold validation on any model that tries to satisfy the first objective: the given sample is split in k subsamples; one of them is taken as a testing set and the separating function is sought on the training set made up by the remainder of the sample. This is done k times and the rates of success obtained on the testing sets give an estimate of the prediction ability. This paper presents an iterative algorithm that solves LP models and is able to find any explicit classification accuracy on a training set. It offers the following salient properties: P1 The algorithm generates a piecewise (non)linear discriminating function. P2 Within a finite number of iterations a complete or an a priori classification accuracy can always be obtained on the whole training set, or on either of the classification sets. P3 The optimization model is linear, but may consider as many binary variables as predetermined by the computational resources. P4 The size of the optimization model decreases with the iteration number; it is then plausible to solve MILPs at the final stages of the method. P5 The algorithm provides a lower bound on the number of errors that will occur in a given testing set. P6 Large systems can be handled by way of parallelism and/or decomposition of the training set. P7 An FFNN is easily adapted, which opens up the possibility of hardware implementations with the use of field programmable gate arrays (FPGAs). See [39] and ([34], Chapters 1 and 10). P8 Finally, only basic programming skills are needed for its implementation. To summarize, our approach enjoys properties P2, P3, P4 which circumvent some of the deficiencies encountered in FFNNs, SVMs and MILPs mentioned earlier. It also exhibits other properties that enhance its usefulness. Before proceeding any further, let us illustrate our approach with an easy example. Fig. 1 depicts a dot curve discriminating the sets PP and NN. Generally this function is unknown to us, and it is in general a difficult task to find out a nonlinear function that discriminates two classes (see [20], and references therein). Let us graphically exemplify how our algorithm obtains a piecewise linear discriminating function. The first iteration of the algorithm finds a (hyper)plane that forces all individuals in NN to be on one side, and all members of PP located on the other side are considered well classified (Fig. 2a). The second iteration works on the reduced set of not yet classified individuals (Fig. 2b) and finds a new (hyper)plane. At this iteration, the hyperplane forces all individuals in PP to be on the same side (Fig. 2c). In this example total classification was obtained; otherwise, the iterations proceed until PP or NN is exhausted. A test is included that may stop the algorithm as soon as a possibility of overfit is detected. Full-size image (13 K) Fig. 1. The unknown dot curve strictly discriminates sets PP and NN. Figure options Full-size image (25 K) Fig. 2. Discriminating function made up by 2 planes. The second plane discriminates the reduced set. Figure options We now formally state the BCP. Let D⊂RnD⊂Rn be a bounded set, and let P,N⊂DP,N⊂D. We assume that there exists a discriminating function f(⋅):D→Rf(⋅):D→R such that f (x ) > 0 for all x∈Px∈P and f (x ) < 0 for all x∈Nx∈N. We also assume that sign(f (x )) is either at hand or easily computable for any x∈T⊆Dx∈T⊆D. In other words, we can easily deduce the class indicator c (x ) in some subset of DD. Let us define the hard -margin ρ (f ) as: equation(2) ρ(f)=min(||x−z||),x∈(P∪N),z∈Z={x∈D:f(x)=0},ρf=min(||x−z||),x∈(P∪N),z∈Z={x∈D:f(x)=0}, Turn MathJax on where || ⋅ || is any norm in R n. Geometrically, we might assume there is a gray zone Zϵ={x∈D:|f(x)|≤ϵ}FZϵ={x∈D:|f(x)|≤ϵ}F, where sign(f) is uncertain, either because of noise or by probabilistic estimation. In any event, the bigger the hard-margin, the better the predictive accuracy of an algorithm [33] and [37]. Our ultimate objective is to elaborate on linear programming optimization models that with the sign information c (x ) = sign(f (x )) for x∈Tx∈T will find a discriminating function h:D→Rh:D→R with a high ρ (h ). In other words, we want the sign functions of h (⋅) and f (⋅) to coincide on (T⊆D)T⊆D; at the same time we want to preserve an acceptable generalization capability. To simplify the notation we denote View the MathML sourceck=defcxk, and in order to have an implementable algorithm we assume that TT is a finite set, which is a common strategy to deal with the binary classification problem: Binary classification problem (BCP): Given a sample of p points (x1, c1), …, (xp, cp) with equation(3) View the MathML sourceT=xk∈D,ck={1whenxk∈P,−1whenxk∈N,k=1,…,p, Turn MathJax on find a function h (x ) with high hard -margin ρ (h ) and such that c (x )h (x ) > 0, for x∈Tx∈T. Table options We recall the reader that we focus the BCP within the objectives O1 and O2. The remainder of the paper is organized as follows: the next section describes the shortcomings and advantages of LP optimization models that have been used for solving BCP and may look as predecessors of our algorithm. Section 3 describes the algorithm and also highlights an equivalent FFNN, which opens up the possibility of a hardware realization of the algorithm. Section 4 addresses some implementation issues that may enhance the algorithm performance and sketches a possible way to adapt the algorithm to a multiprocessing environment. In Section 5 synthetic and standard datasets are run with the algorithm suggested in this paper. Competitive results are reported. Finally, Section 6 concludes the paper with additional comments and final remarks.