انتخاب ویژگی برای ماشین آلات بردار پشتیبانی از طریق مخلوط برنامه ریزی انتگرال خطی
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
25507 | 2014 | 13 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Sciences, Volume 279, 20 September 2014, Pages 163–175
چکیده انگلیسی
The performance of classification methods, such as Support Vector Machines, depends heavily on the proper choice of the feature set used to construct the classifier. Feature selection is an NP-hard problem that has been studied extensively in the literature. Most strategies propose the elimination of features independently of classifier construction by exploiting statistical properties of each of the variables, or via greedy search. All such strategies are heuristic by nature. In this work we propose two different Mixed Integer Linear Programming formulations based on extensions of Support Vector Machines to overcome these shortcomings. The proposed approaches perform variable selection simultaneously with classifier construction using optimization models. We ran experiments on real-world benchmark datasets, comparing our approaches with well-known feature selection techniques and obtained better predictions with consistently fewer relevant features.
مقدمه انگلیسی
Support Vector Machines (SVMs) has been shown to be a very powerful machine learning method. For classification tasks and based on the structural risk minimization principle [18], this method attempts to find the separating hyperplane which has the largest distance from the nearest training data points of any class. SVM provides several advantages such as adequate generalization to new objects, a flexible non-linear decision boundary, absence of local minima, and a representation that depends on only a few parameters [18] and [21]. Feature selection is one of the most important steps within classification. An appropriate selection of the most relevant features reduces the risk of overfitting, thus improving model generalization by decreasing the model’s complexity [7]. This is particularly important in small-sized high-dimensional datasets, where the curse of dimensionality is present and a significant gain in terms of performance can be achieved with a small subset of features [9] and [13]. Additionally, low-dimensional representation allows better interpretation of the resulting classifier. This is particularly important in applications in fields such as business analytics, since many machine learning approaches are considered to be black boxes by practitioners who therefore tend to be hesitant to use these techniques [6]. A better understanding of the process that generates the data by identifying the most relevant features is also of crucial importance in life sciences, where we want to identify, for example, those genes that best explain the presence of a particular type of cancer, and therefore could improve cancer incidence prediction. Since the selection of the best feature subset is considered to be an NP-hard combinatorial problem, many heuristic approaches for feature selection have been presented to date [7]. With the two Mixed Integer Linear Programs for simultaneous feature selection and classification that we introduce in this paper we show that integer programming has become a competitive approach using state-of-the-art hardware and solvers. In particular, we propose two novel SVM-based formulations for embedded feature selection, which simultaneously select relevant features during classifier construction by introducing indicator variables and constraining their selection via a budget constraint. The first approach studies an adaptation of the l1l1-SVM formulation [4], while the second one extends the LP-SVM method presented in [22]. Our experiments show that the proposed methods are capable of selecting a few relevant features in all datasets used, leading to highly accurate classifiers within reasonable computational time. Section 2 of this paper introduces Support Vector Machines for binary classification, including recent developments for feature selection using SVMs. The proposed feature selection approaches are presented in Section 3. Section 4 provides experimental results using real-world datasets. A summary of this paper can be found in Section 5, where we provide its main conclusions and address future developments.
نتیجه گیری انگلیسی
In this work we presented two embedded approaches for simultaneous feature selection and classification based on Mixed Integer Linear Programming and Support Vector Machines. The main idea is to perform attribute selection by introducing binary variables, obtaining a low-dimensional SVM classifier. Two different SVM-based linear programming formulations, namely l1l1-SVM and LP-SVM, were adapted to Mixed-Integer Programming formulations. A comparison with other feature selection approaches for SVM in low- and high-dimensional datasets showed the advantages of the proposed methods: • They allow the construction of a classifier for a desired number of attributes without the need of two-step methodologies that perform feature selection and classification independently. • They provide better predictive performance compared to feature ranking techniques, based on their ability to discard irrelevant attributes by limiting the number of features in the model via a budget constraint. • Our approaches find the optimal subset of features in one run, given a predefined number of attributes (the budget parameter), while sequential approaches require multiple runs and successive training steps to reach a desired number of features. • They determine an optimal solution for the feature selection problem in reasonable running times given a predefined number of features. From the experimental section of this work, several conclusions can be drawn. Predictive performance (in terms of AUC) can be improved with fewer variables, demonstrating the relevance of feature selection. In our experiments, in all seven datasets a gain in terms of performance was achieved using feature selection, or at least performance was maintained. By contrast, for microarray data, and in particular for the Colorectal Microarray, our approaches led to a significant improvement in terms of performance (a gain of almost 7% in terms of AUC using MILP1). In general, our models performed consistently better than alternative feature selection approaches. Additionally, the results of the proposed models proved to be robust and stable for different values of the parameters used for calibration. Finally, the algorithms’ running times are adequate for most machine learning tasks, such as classification of microarray data. There are several opportunities for future work. First, the extension of the proposed methods to kernel approaches may lead to better performance thanks to the ability of constructing non-linear classifiers, while selecting the relevant attributes in the original space. The main challenge is to incorporate binary variables associated with the weight vector into a kernel-based formulation. Secondly, although in this work all attributes are treated equally, the proposed approach has the potential of incorporating different costs of different features in the budget constraint. Credit scoring, fraud detection, and churn prediction are some interesting application areas where the acquisition costs of each attribute may differ, and those costs can be estimated in order to construct a classifier that constrains or minimizes acquisition costs while classifying adequately.