گنجاندن دانش تخصصی داخل طبقه بندی داده کاوی: یک برنامه کاربردی در اعطای وام های غیر مستقیم
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22111 | 2008 | 13 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Decision Support Systems, Volume 46, Issue 1, December 2008, Pages 287–299
چکیده انگلیسی
Data mining techniques have been applied to solve classification problems for a variety of applications such as credit scoring, bankruptcy prediction, insurance underwriting, and management fraud detection. In many of those application domains, there exist human experts whose knowledge could have a bearing on the effectiveness of the classification decision. The lack of research in combining data mining techniques with domain knowledge has prompted researchers to identify the fusion of data mining and knowledge-based expert systems as an important future direction. In this paper, we compare the performance of seven data mining classification methods—naive Bayes, logistic regression, decision tree, decision table, neural network, k-nearest neighbor, and support vector machine—with and without incorporating domain knowledge. The application we focus on is in the domain of indirect bank lending. An expert system capturing a lending expert's knowledge of rating a borrower's credit is used in combination with data mining to study if the incorporation of domain knowledge improves classification performance. We use two performance measures: misclassification cost and AUC (area under the curve). A 2 × 7 factorial, repeated-measures ANOVA, with the two factors being domain knowledge (present or absent) and data mining method (seven methods), as well as a special statistical test for comparing AUCs, is used for analyzing the results. Analysis of the results reveals that incorporation of domain knowledge significantly improves classification performance with respect to both misclassification cost and AUC. There is interaction between classification method and domain knowledge. Incorporation of domain knowledge has a higher influence on performance for some methods than for others. Both measures—misclassification cost and AUC—yield similar results, indicating that the findings of the study are robust.
مقدمه انگلیسی
Data mining techniques have been applied to solve classification problems for a variety of applications, including credit scoring, bankruptcy prediction, insurance underwriting, and management fraud detection. These techniques automatically induce prediction models, called classifiers, based on historical data about previously solved problem cases. The classifiers can then be applied to recommend solutions to new problem cases. In many of the application domains that have been studied by data mining researchers, there exist human experts who have developed their expertise through years of experience in solving problems in those domains. An expert's knowledge tends to be heuristic in nature. Because experts often find it difficult to articulate the heuristics or rules of thumb that they use to efficiently solve a problem, acquiring their expertise is usually a difficult and challenging task. This phenomenon is commonly referred to as the knowledge acquisition bottleneck [20]. A major benefit of using a data mining technique is that it bypasses the knowledge acquisition bottleneck. By unearthing the patterns or knowledge from the data itself, data mining methods obviate the need for eliciting knowledge from a human expert. Clearly, data mining lends itself naturally to domains in which there is a dearth of human expertise or in which domain knowledge cannot be easily formalized. However, there are domains that have large bodies of domain knowledge encapsulated in the form of human expertise. Also, in some of those domains, there exist large volumes of data. But very little research has been conducted to examine if domain knowledge can be incorporated into data mining for better performance. As Dybowski et al. [10] (p. 293) stated: “At present, knowledge engineering and machine learning remain largely separate disciplines, yet, in many fields of endeavor, substantial human expertise exists alongside data archives. When both data and domain knowledge are available, how can these two resources effectively be combined to construct decision support systems?” In this paper, we address the question by examining if such a fusion of domain knowledge and data could improve classifier performance in the domain of indirect bank lending. We compare the performance of seven data mining classification methods—naive Bayes, logistic regression, decision tree, decision table, neural network, k-nearest neighbor, and support vector machine—with and without incorporating domain knowledge. An expert system capturing a lending expert's knowledge of rating a borrower's credit is used in combination with data mining to study whether the incorporation of domain knowledge improves classification performance. We use two measures—misclassification cost and AUC (area under the curve)—for evaluating classifier performance. Our study makes an important contribution to existing research in data mining by empirically investigating whether domain expertise improves performance of classifiers built using different methods. Also, instead of using classification accuracy or error rate as the sole performance measure—as has been the norm in fusion research—we evaluate the classifiers with respect to misclassification cost under a range of cost ratios, and with respect to AUC, an aggregate measure. The paper is organized as follows. Section 2 reviews the related work in the area. Section 3 describes the domain knowledge and Section 4 describes the seven classification methods used in this study. Section 5 presents the theoretical framework and research questions. Section 6 describes the research design and methodology. Section 7 presents the results and Section 8 provides a discussion of the results and their implications. Section 9 summarizes the contributions of this study and outlines directions for future research.
نتیجه گیری انگلیسی
9. Conclusion and future directions The primary objective of the study was to examine the role of domain knowledge on classifier learning. To that end, we incorporated knowledge of credit rating into the learning process and investigated whether that resulted in better classifier performance. The results of this study underscore the synergy between domain knowledge and data mining, especially in situations where knowledge and data are both limited. Apportioning responsibilities between knowledge and data mining is an important issue for effective fusion to take place. When knowledge such as credit rating is readily available from experts or other sources, it makes sense to incorporate that knowledge into the decision process. On the other hand, it is relatively more difficult to capture the knowledge of the loan approval decision, which involves a certain amount of subjectivity. For example, if the debt to income ratio is much lower than the cutoff (say, 36%), a loan officer could use that to compensate for other attributes whose values are not that promising. Acquiring the knowledge of all such nuances and shortcuts is an arduous and time-consuming task, lending itself well to a data mining solution. But if an expert system for loan approval is available, a future direction would be to compare its performance with that of the fusion method presented in this paper. Another possibility is to examine if cascading the data mining methods could lead to comparable performance results. As pointed out earlier in the paper, the knowledge engineering and data mining fields have remained largely independent, with little effort devoted to the issue of fusion of the two. Our research builds upon some of the prior work in that area. It is more comprehensive than the earlier studies by including seven of the most popular data mining methods and two performance measures. While many of those studies have used accuracy or error rate as the sole performance criterion (e.g., [2] and [19]), we assess performance using misclassification cost, which is a more appropriate measure to use in cost-sensitive problems such as the one in this study. We also used the AUC measure for assessing classifier performance across the spectrum of possible costs. Note that Pazzani et al. [35] found that adding domain knowledge resulted in significant cost reduction only when larger data sets were used. They used a relational learning algorithm in which the explanation-based part uses the knowledge base of an expert system and the inductive part adds, deletes, and revises existing rules in that knowledge base. In contrast, we did not change any of the internal parameters of the expert system, but used its output as an additional input to the classifiers. Also, note that we evaluated the effects of domain knowledge on the performance of seven widely-used data mining methods. The main contribution of our research is in demonstrating that domain expertise captured in the form of a partial knowledge base can significantly improve the performance of a wide variety of classifiers on relatively small data sets. Instead of focusing on one or two learning methods, as has been the norm, we evaluated seven of the most widely used, commercially available methods, and found that the results hold across almost the entire spectrum. Another important contribution of this study is that the incorporation of domain knowledge affects different classifiers to different degrees, something that has not been empirically examined in prior research. Our study opens up several avenues for future research. More studies are needed in different domains to understand what other types of knowledge could influence performance. For instance, future studies could explore the effects of theory-based feature construction and subsequent refinement of that knowledge through data mining. We focused on a binary classification task in this study, but future studies could examine if the results extend to regression or value prediction tasks such as sales forecasting. Bayesian belief networks have been successfully applied in several problems. Because the knowledge of the dependency structure of such networks is not usually available beforehand, prior research has primarily focused on deriving the structure from available data. However, if knowledgeable experts exist in the domain, their knowledge could be used to define the structure. For example, the domain knowledge in our study could be employed to develop a partial structure for a Bayesian network (e.g., [32]), which could then be refined based on the available data. Comparing the performance of such a network with a network that is developed fully using data is an interesting future direction.