یک روش مبتنی بر قانون تصمیم گیری برای انتخاب ویژگی در داده کاوی پیشگویانه
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|22175||2010||8 صفحه PDF||سفارش دهید||5410 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 37, Issue 1, January 2010, Pages 602–609
Algorithms for feature selection in predictive data mining for classification problems attempt to select those features that are relevant, and are not redundant for the classification task. A relevant feature is defined as one which is highly correlated with the target function. One problem with the definition of feature relevance is that there is no universally accepted definition of what it means for a feature to be ‘highly correlated with the target function or highly correlated with the other features’. A new feature selection algorithm which incorporates domain specific definitions of high, medium and low correlations is proposed in this paper. The proposed algorithm conducts a heuristic search for the most relevant features for the prediction task.
Algorithms for feature selection in predictive data mining for classification problems attempt to select those features that are relevant, and are not redundant for the classification task. A relevant feature is defined as one which is highly correlated with the target function (Blum and Langley, 1997 and Hall, 1999). A redundant feature is defined as one which is highly correlated with other features (Hall, 1999 and Ooi et al., 2007). One problem with the definition of feature relevance is that there is no universally accepted definition of what it means for a feature to be ‘highly correlated with the target function or highly correlated with the other features’. Different fields of enquiry use different thresholds for correlation values to distinguish between high and low correlations ( Cohen, 1988). The correlation based feature selection algorithms that have been reported in the literature, such as correlation-base feature selection (CFS) (Hall, 1999) and differential prioritisation (DP) (Ooi et al., 2007), employ heuristic search procedures which use mathematical functions to compute measures of merit which provide a high value for relevant feature subsets and low values for non-relevant feature subsets. The problem with these algorithms is that, the mathematical functions they use are lacking in flexibility and precision. Since these algorithms do not use precise definitions of high and low correlation, they cannot be used to perform feature selection based on domain meanings of high and low correlations. Secondly, as will be demonstrated in this paper, they can make bad decisions on which features are most relevant. It is desirable to use a feature subset selection algorithm which will first of all select features based on precise definitions of high correlation and low correlation. Secondly, the algorithm should never select pure noise or prefer pure noise over features which have a high or medium correlation to the target function. The required level of precision can be achieved by determining the merit of a feature subset using logic that is implemented as a programmed function. In order to provide flexibility, parameters are used in the function, so that the user can provide threshold values to distinguish between a high, medium and low correlation. In this paper, an algorithm which does precisely this, is proposed. The algorithm incorporates user specified thresholds as well as decision rules, in order to select feature subsets. Experimental results are presented to demonstrate the weaknesses of the CFS and DP feature selection procedures, and to demonstrate how the proposed algorithm eliminates these problems. The rest of the paper is organised as follows. Section 2 provides a review of some of the literature on feature selection. Section 3 provides a description of the proposed feature selection algorithm, based on decision rules. Section 4 provides a description of the experiments that were conducted, as well as the experimental results. Section 5 concludes the paper
نتیجه گیری انگلیسی
The weaknesses of heuristic feature subset selection, based on a heuristic measure that is implemented as a mathematical function, have been demonstrated in this paper. It has further been demonstrated that, the use of a decision rule based heuristic search can lead to the elimination of all irrelevant and redundant features, based on domain specific definitions of high, medium and low correlation. The predictive classification results have demonstrated that when classifiers are constructed using the features selected by the decision based rule search procedure, the predictive performance obtained is comparable to when all features are used and when pure ranking is used. The main conclusion of this paper is that, decision rule-based feature selection enables one to incorporate domain-specific definitions of feature relevance into the feature selection process. When a feature subset search algorithm uses decision rules to guide the search, the algorithm makes better decisions compared to when mathematical functions are used. This leads to the selection of features that provide a high level of predictive classification performance.