بهبود بهره وری از یک روش برنامه ریزی خطی عدد صحیح مختلط استوار برای مسئله طبقه بندی چند طبقه
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
25440 | 2013 | 5 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computers & Industrial Engineering, Volume 66, Issue 2, October 2013, Pages 383–388
چکیده انگلیسی
Data classification is one of the fundamental issues in data mining and machine learning. A great deal of effort has been done for reducing the time required to learn a classification model. In this research, a new model and algorithm is proposed to improve the work of Xu and Papageorgiou (2009). Computational comparisons on real and simulated patterns with different characteristics (including dimension, high overlap or heterogeneity in the attributes) confirm that, the improved method considerably reduces the training time in comparison to the primary model, whereas it generally maintains the accuracy. Particularly, this speed-increase is significant in the case of high overlap. In addition, the rate of increase in training time of the proposed model is much less than that of the primary model, as the set-size or the number of overlapping samples is increased.
مقدمه انگلیسی
Data classification is generally known as a method for constructing a set of classification rules (classifier), based on data attributes, in order to predict class membership of unknown data. This task is accomplished in two phases. In the training phase, a set of data with known membership (training dataset), is used to estimate the parameters of boundaries between classes. After these boundaries are determined, in the second phase, some new data (test data), which are not used for training, is assigned to a class according to the boundaries obtained, in order to evaluate the accuracy of the model. The latter phase is called testing procedure. For instance, a dataset can be a set of vectors representing chemical attributes of patients’ blood and each class can be a disease. Extensive applications of data classification in different fields of engineering, medicine, industry and finance, result in developing a variety of methods. Decision tree induction (Safavian & Landgrebe, 1991), neural networks (Hertz, Palmer, & Krogh, 1991), Bayesian networks (Russell & Norvig, 2010) and genetic algorithm are some examples. A simple classifier, which is constructed by a linear programming (LP) formulation, can be in the form of a hyper-plane separating two sets (Bennett and Mangasarian, 1992a, Freed and Glover, 1986, Glover, 1990, Glover et al., 1988, Lam et al., 1996 and Lam and Moy, 2002). Erenguc and Koehler (1990) studied different types of LP models according to their objective functions. In addition, Koehler (1990) studied the problems that may appear in formulating an LP model, such as choosing an objective function, unbounded or unacceptable solutions, side constraints, data translation and transformation. A survey on mathematical programming models can be found in Pai, 2009 and Lee and Wu, 2009. Since hyper-planes are not sufficiently accurate in many real cases, piecewise linear classifiers (as an approximation for nonlinear boundaries) have extensively been studied. Multi Surface Method (MSM) (Mangasarian, 1968) is a piecewise LP-based classifier that forms a hyper-plane at each stage until all data points become completely separated. Ryoo (2006) has proposed a Piecewise Linear and Convex (PLC) discriminant function and used it to develop a mixed integer linear programming (MILP) model. He has compared the PLC method with two common methods including MSM. Results show that, MSM constructs more complex boundaries than the ideal shape when using on datasets with high overlap. On the other hand, PLC is appropriate for data with convex boundaries and may fail to perform accurately in other cases. A piecewise linear function is introduced by Astorino and Gaudioso (2002) that constructs multiple hyper-planes to form a convex polyhedron. Comparing to other linear methods, this model requires high numerical computations. All the above models are introduced for solving the two-class classification problem. There are fewer mathematical programming models for the multi-class case. Bennett and Mangasarian (1992b) have proposed a piecewise linear classifier which is an extension of their former model for the two-class case. An LP model is proposed by Bal and Orkcu (2011) which is a combination of three LP models proposed by Gehrlein, 1986, Gochet et al., 1995 and Sueyoshi, 2006. Freed and Glover (1981) presented a linear programming model that assigns intervals to classes. In this model, the data are classified into intervals according to their discriminant scores. A drawback of this method is that an appropriate order for classes should be determined in advance; otherwise, it yields low accuracy. Gehrlein (1986) suggested an MILP method that overcomes the problem of sorting, by introducing binary variables for each pair of classes. If the intervals in Gehrlein’s model are defined for each direction (attribute) separately, and data coordinates in RmRm are used instead of discriminant scores in RR, then hyper-boxes are produced instead of a single hyper-plane. Uney and Turkay (2006) presented a new data classification method based on the use of hyper-boxes for determining the class boundaries. They developed an MILP model by converting the relationship among discrete variables to their equivalent integer constraints using Boolean algebra. They evaluated the performance of their proposed method on Iris dataset (Fisher, 1936). Xu and Papageorgiou (2009) formulated an MILP model in a similar concept for enclosing the training data using hyper-boxes. They suggested an algorithm that applies the MILP model iteratively in order to improve the accuracy. Kone and Karwan (2011) extended this model for predicting cost-to-serve (CTS) values of new customers in industrial gas business. Although hyper-planes are efficient in classifying data into two sets, they can be inaccurate and inefficient when applied to solve multi-class problems (Uney & Turkay, 2006). Fig. 1 shows the schematic representation of classifying three datasets using hyper-planes and hyper-boxes. As can be seen in Fig. 1, hyper-boxes have more flexibility for estimating ideal class boundaries of a pattern in comparison to rigid hyper-planes, particularly for classifying multiple classes. For instance, if square or circle data points are omitted from the pattern in Fig. 1, then the other two sets can be completely separated by a single hyper-plane, but there are misclassified points when using hyper-planes for separating three sets. In addition, the boundaries, which are constructed using hyper-boxes, look closer to the actual class boundaries in comparison to the borders obtained on the same dataset using hyper-planes. However, integer linear programming (ILP) problems are generally NP-complete (Schrijver, 1998). This confirms that, large-scale ILP problems are impractical due to being extremely time-consuming for real-world applications. Full-size image (45 K) Fig. 1. Linear classifiers: hyper-boxes have more flexibility for estimating the ideal class boundaries of a pattern in comparison to rigid hyper-planes. Figure options This research is focused on Xu and Papageorgiou’s MILP-based method. This method can be inefficient when the size of the training set or the overlapping area is increased. Modifications are suggested to improve the efficiency of their proposed model and algorithm. In the next section, the MILP model and the related iterative algorithm, which are suggested by Xu and Papageorgiou (2009) is briefly stated. The new approach is proposed in Section 3. Numerical computations and comparisons of two methods are illustrated in Section 4. Section 5 concludes the discussion. Notations are as follows: Consider a classification problem with G classes and n samples (i = 1, ... , n ). The class membership of samples is supposed to be known. Each sample is a vector in RmRm where m is the number of attributes. The parameter aijaij represents the value of sample i on attribute j (j = 1, ... , m). 2. Xu and Papageorgiou’s MILP model In this section, Xu and Papageorgiou (2009)’s model for the multi-class classification problem is summarized. The training process is performed in two stages. In the first stage, an MILP model produces boundaries that form one hyper-box for each class. A hyper-box r is recognized by its central coordinate (Brj)(Brj) and length (LErj)(LErj) on each attribute j (j = 1, ... , m ) and is assigned to one of the classes 1, … , G . riri exists if sample i is assigned to hyper-box r . The objective is to minimize the total number of misclassifications by numerating non-zero variables EiEi. The binary variable EiEi is equal to 1 if sample i is included in the corresponding hyper-box and is zero otherwise. In addition, the binary variable YrsjYrsj is introduced to prevent boxes with different class labels overlapping each other. YrsjYrsj is zero if box r and s do not overlap on attribute j ; otherwise it is equal to 1. U and εε are respectively large and small positive constants with arbitrary values. The parameter AA is the initial number of hyper-boxes. The complete MILP model for Multi-Class data classification Problem (MCP) is formulated as follows: View the MathML sourcemin∑i=1n(1-Ei) Turn MathJax on Subject to: equation(1) View the MathML sourceaij⩾Brj-LErj2-U(1-Ei)∀i,ri,j Turn MathJax on equation(2) View the MathML sourceaij⩽Brj+LErj2+U(1-Ei)∀i,ri,j Turn MathJax on equation(3) View the MathML sourceBrj-Bsj+U·Yrsj⩾LErj+LEsj2+ε∀j∀r,s=1,…,A;s≠r Turn MathJax on equation(4) View the MathML source∑j=1m(Yrsj+Ysrj)⩽2m-1∀r=1,…,A-1,s=r+1,…,A Turn MathJax on View the MathML sourceEi,Yrsj∈{0,1};LErj⩾0;Brj:unrestricted in sign Turn MathJax on Note that, constraints (1) and (2) are generated for hyper-box r and sample i only if the corresponding ri exists. A single BrjBrj and LErjLErj is created on each attribute j for all the samples that are assigned to the hyper-box r. Constraints (3) and (4) (non-overlapping constraints) are added in order to prevent each pair of boxes with different class labels sharing same regions simultaneously in all dimensions. The above model is interpreted in details by Xu and Papageorgiou (2009). In order to improve the accuracy of the MCP model, an iterative algorithm is introduced in the second stage of the training process. Multiple boxes are constructed for each class and the number and position of these boxes are optimized during the second stage. Steps of their proposed algorithm are as follows:
نتیجه گیری انگلیسی
In this research a new model and algorithm is proposed to improve the work of Xu and Papageorgiou (2009), in which an MILP-based model was introduced for solving the multi-class data classification problem. In Xu and Papageorgio’s model, the training process is performed in two stages: In the first stage, the MILP model produces boundaries that form one hyper-box for each class. The objective is to minimize the total number of misclassifications. In order to improve the accuracy of the model (named as MCP), an iterative algorithm is introduced in the second stage of the training process, which allows multiple boxes to be produced for each class and optimizes the number and position of these boxes. The iterative procedure continues until there is no way to achieve a better solution by adding another new hyper-box. In each iteration, a new MILP model is solved. This can result in high computational time, and therefore, low efficiency of the algorithm, when applying on large size datasets. The proposed improved model in this paper considerably reduces the training time of the original model. The simple idea of the suggested algorithm is to use the boundaries obtained from previous iterations and eliminate the correctly classified points from the training set in each iteration, instead of using whole dataset for calculating new boundaries. Therefore, the enclosing constraints corresponding to these points are not produced and thus the number of binary variables is step by step decreased. In addition, It can be shown that the number of non-overlapping constraints (and thus the corresponding binary variables) in MCP is always greater than the number of this type of constraints in the suggested new model (named as MCPm). The increasing rate of this number is of order 2 in MCP and of order 1 in MCPm according to the maximum number of same-label boxes. The performance of the suggested algorithm is compared with MCP algorithm on five real and two artificial datasets, which are used by Xu and Papageorgiou (2009). Results show that, the suggested algorithm is always faster in comparison to the other algorithm. Besides, its accuracy is maintained in general. According to Table 2 and Table 3, the rate of increase in the training time of MCPm algorithm is much less than that of MCP, as the size of the training set or the number of overlapping samples rises. Particularly, the speed-increase of MCPm can be substantial on overlapping regions. In future work, the author would like to compare the training time versus accuracy of MCPm method with other standard classification functions for solving both multi-class and binary problems, particularly on real datasets. Extending MCPm algorithm to deal with data samples with disjoint class regions is also an interesting direction for future study. However, the existence of binary variables makes ILP-based models time-consuming in general. As mentioned earlier, using hyper-box enclosure method gives us the desired flexibility to approximate boundaries accurately. Hence, designing algorithms using this method which are based on optimization techniques (other than ILP) also opens a new area for research.