پشتیبانی از برنامه ریزی خطی ماشین بردار
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
25068 | 2002 | 10 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Pattern Recognition, Volume 35, Issue 12, December 2002, Pages 2927–2936
چکیده انگلیسی
Based on the analysis of the conclusions in the statistical learning theory, especially the VC dimension of linear functions, linear programming support vector machines (or SVMs) are presented including linear programming linear and nonlinear SVMs. In linear programming SVMs, in order to improve the speed of the training time, the bound of the VC dimension is loosened properly. Simulation results for both artificial and real data show that the generalization performance of our method is a good approximation of SVMs and the computation complex is largely reduced by our method.
مقدمه انگلیسی
Since the 1970s’, Vapnik et al. have applied themselves to the study of statistical learning theory [1], [2] and [3]. Until the early of the 1990, a new kind of learning machines, support vector machine (SVM), was presented based on those theories [2], [4] and [5]. The main study of statistical learning theory is the model of learning from examples, which can be described as: there are l random independent identically distributed examples drawn according to the uniform probability distribution . Given a set of functions (where Λ is a parameter set) from which the goal of learning from examples is to select a function that can express the relationship between and y in the best possible way. In general, in order to obtain , one has to minimize the expected risk functional equation(1.1) where measures the loss between the response y to a given input and the response provided by the learning machine. The learning problems such as the problems of pattern recognition, regression estimation and density estimation may be taken as the same learning models with different loss functions [2]. In this paper, we only deal with the problem of pattern recognition. Consider the following loss function equation(1.2) Due to the probability distribution in Eq. (1.1) is unknown, the expected risk functional is replaced by the empirical risk functional equation(1.3) In order to know the quality of the empirical risk Remp(α) to approximate the expected risk, Vapnik presented the following bound theorem [2]. With probability at least 1−η (0⩽η⩽1), the inequality equation(1.4) holds true. Where and h is the VC dimension of the set of functions , α∈Λ. From Eq. (1.4), we can see that the minimization of the expected risk R(α) is equal to the minimization of the two terms on the right-hand side of Eq. (1.4) at the same time. The first term on the right of Eq. (1.4)Remp(α) is minimized by learning process. The second term varies with the VC dimension h and the number of examples l. The smaller the VC dimension h and the larger the number of examples l, the smaller the value of the second term. In fact, the number of examples is finite. So for the case of a small example set, the minimization of the expected risk is implemented by minimizing the empirical risk and the VC dimension. Generally speaking, a complex target function set or a large hypothesis space is required for minimizing the empirical risk. But a small hypothesis space is requested for minimizing the VC dimension of the target function set. Therefore, the minimization problem is in a dilemma, the best solution of the problem is to take a compromise between them. Now, restrict the target function to the linear function. Similar to the set of Δ-margin separating hyperplanes defined in Ref. [3], we define the set of mΔ-margin separating hyperplanes. Let us denote the target functions set by . If these functions classify an example as follows: equation(1.5) then the set is called the set of mΔ-margin separating hyperplanes whose margin is equation(1.6) where ||·||2 denotes l2-norm, namely Euclidean distance. There is a conclusion about the VC dimension of the set of mΔ-margin separating hyperplanes. Let vectors belong to a sphere of radius R. Then the set of mΔ-margin separating hyperplanes has the VC dimension h bounded by the inequality: equation(1.7) where n is the dimension of input space. From Eq. (1.7), we can see that if the VC dimension of the target functions set h is <n, then h varies inversely with the margin mΔ2. In this way, can be approached by minimizing the empirical risk functional and maximizing the separating margin mΔ, which is the structural risk in SVMs introduced by Vapnik: equation(1.8) where the constant C>0 is a parameter chosen by the users and α is a parameter of the target function set. For l random independent identically distributed examples , the linear SVMs for pattern recognition have the following optimization problem [4]: equation(1.9) where (·,·) denotes the inner product. Minimizing the first term in Eq. (1.9) plays the role of controlling the capacity of the learning machine and avoiding the overfitting of the machine. While minimizing the second term is to minimize the empirical risk. The Wolfe dual programming of Eq. (1.9) is [4] equation(1.10) equation(1.11) equation(1.12) The decision function has the following form: equation(1.13) and equation(1.14) The kernel functions are introduced in linear SVMs, which leads to nonlinear SVMs [4]: equation(1.16) equation(1.17) equation(1.18) And then the decision function can be written as equation(1.19) In a nutshell, the theory foundation of SVMs (statistical learning theory) is rather perfect. But training a SVM requires the solution of a quadratic programming (QP) optimization that is not easy to implement, in particular for large-scale problems. Based on the statistical learning theory, the linear programming SVMs that are extremely simple without explicitly solving QP problems are proposed.
نتیجه گیری انگلیسی
Linear programming SVMs are proposed based on the statistical learning theory, in particular the theory of capacity control, VC dimension and structural risk. The bound of VC dimension in LPSVMs is loosened to some extent, which leads to the loss in generalization performance that is tolerable. Training LPSVMs is simpler than QPSVMs, especially for large-scale problems. LPSVMs at least make an order magnitude improvement in the training speed, so it is worth while losing some generalization performance. We note that the VC dimension of LPSVMs is larger than that of QPSVMs in Section 3.4, but the generalization error of LPSVMs is smaller than that of QPSVMs in linear case, which is not strange. As we know, SVMs define a structure risk by the empirical risk and the VC dimension and give a bound to the expected risk. Although SVM algorithm can guarantee the structure risk converge to the expected risk with increasing number of examples. The structure risk is not the expected one exactly. Especially the bound of the expected risk is not very accuracy for the finite number of examples. It is possible that other learning machines get better generalization performance than SVMs, for example, Euclidean distance classifier is the optimal one for Gaussian model [15]. Roughly we agree to the viewpoint in Ref. [17]: SVMs are the general and very efficient statistical learning model, but not the optimal. Priori information is helpful for us to select proper learning machine. In fact it is the problem concerned by statistical theory researchers that how to introduce the priori information into SVMs to obtain good generalization performance. Presently in QPSVMs many algorithms for large-scale problems have been developed, such as Chunking algorithm [18], Osuna algorithm [19] and SMO algorithm [20]. Further studies will aim to find the algorithm for large-scale problems in LPSVMs. In addition the property (say geometry) of LPSVMs deserves further research.