کشف مفهوم بر اساس ILP در داده کاوی چند رابطه ای
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22171 | 2009 | 11 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 36, Issue 9, November 2009, Pages 11418–11428
چکیده انگلیسی
Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate high-quality patterns. In this work, an ILP-based concept discovery method, namely Confidence-based Concept Discovery (C2D), is described in which strong declarative biases and user-defined specifications are relaxed. Moreover, this new method directly works on relational databases. In addition to this, a new confidence-based pruning is used in this technique. We also describe how to define and use aggregate predicates as background knowledge in the proposed method. In order to use aggregate predicates, we show how to handle numerical attributes by using comparison operators on them. Finally, we analyze the effect of incorporating unrelated facts for generating transitive rules on the proposed method. A set of experiments are conducted on real-world problems to test the performance of the proposed method.
مقدمه انگلیسی
Due to the impracticality of single-table data representation, relational databases are needed to store complex data for real life applications. This has led to development of multi-relational learning systems that are directly applied to relational data (Domingos, 2003 and Džeroski, 2003). Relational upgrades of data mining and concept learning systems generally employ first-order predicate logic as representation language for background knowledge and data structures. The learning systems, which induce logical patterns valid for given background knowledge, have been investigated under a research area, called Inductive Logic Programming (ILP) (Muggleton, 1999). Using logic in data mining is a common technique in the literature (Assche et al., 2006, Dehaspe and Raedt, 1997, Frank et al., 2007, Knobbe et al., 2002, Lee et al., 2008, Leiva, 2002, Muggleton, 1995, Neville et al., 2003, Perlich and Provost, 2003, Quinlan, 1990, Srinivasan, 1999, Toroslu and Yetisgen-Yildiz, 2005 and Yin et al., 2004). In this work, we propose a predictive2 concept learning ILP system, namely Confidence-based Concept Discovery (C2D), which employs relational association rule mining concepts and techniques. We aimed to overcome some of the limitations of previous systems such as mode declaration, requirement of negative data and discarding aggregate predicates. C2D utilizes absorption operator of inverse resolution for generalization of concept instances in the presence of background knowledge and refines these general patterns into frequent and strong concept definitions with an Apriori-based specialization operator based on confidence. A challenging problem of relational concept discovery is dealing with intractably large search space. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to prune the search space. However, there is a trade-off between pruning the search space and generating high-quality patterns. Therefore, the idea is to balance this trade-off with effective pruning mechanisms. To do this, C2D utilizes four new pruning mechanisms. Aggregate functions provide a rich mechanism for expressing the characteristics of the relations having one-to-many relationships among them. Such relationships are common in databases. In concept discovery, conditions on aggregation such as count < 10 or sum > 100 may define the basic characteristic of a given concept better. For this reason, in this work, we extend the background knowledge with aggregate predicates in order to characterize the structural information that is stored in tables and associations between them. Aggregate predicates have numeric attributes by their nature. Therefore, in order to add aggregate predicates into the system numeric attribute types should also be handled. Since it is not useful and feasible to define concepts on specific numeric values, in this work, only comparison operators containing numeric attributes are considered in concept discovery. When the target concept has common attribute types with only some of the background predicates, the rest of the predicates (which are called unrelated relations) can never take part in hypothesis. This prevents the generation of transitive rules through such predicates, which is an important drawback when transitive rules are the only way to describe the target concept. To solve this problem, an optional built-in function is implemented in C2D which generates transitive rules. Major contributions of this work can be listed as follows: 1. The main difficulty in relational ILP systems is searching in intractably large hypothesis spaces. In order to cope with this problem, relational ILP systems put strong declarative biases on the semantics of hypotheses. In this work, we aimed to relax the declarative biases in such a way that body clauses may have variables which do not exist in the head predicate. On the other hand, in order to reduce the search space, a confidence-based pruning mechanism is used. 2. Many multi-relational rule induction systems require the user to determine the input–output modes of predicate arguments. Since mode declarations require a high level Prolog and domain knowledge, it is not meaningful to expect such a declaration from an ordinary user. Instead of this, we use the information about relationships between entities in the database if given. Therefore, in this work, the novel user knowledge about domain is not required. 3. Muggleton shows that (Muggleton, 1996), the expected error of an hypothesis according to positive versus all (positive and negative) examples do not have much difference if the number of examples is large enough. In other words, logic programs are learnable with arbitrarily low expected error from only positive examples. As relational databases contain only positive information, a pure multi-relational data mining system based on logic programming could be developed which relies on only positive instances stored as relations. Therefore, the proposed system directly works on relational database, without any requirement of negative instances. If negative instances exist in the database, C2D can also handle them. 4. The definition of confidence is modified to apply Closed World Assumption (CWA) in relational databases. We introduce type relations to the body of the clauses in order to express CWA. 5. The choice of hypothesis evaluation criteria is an important factor on the quality of the generated patterns. In this work, we used an improved confidence-based hypothesis evaluation criterion, namely f-metric, which will be described in the following sections. 6. Only some of the ILP-based classification systems define aggregate predicates in their algorithms. However, better rules (higher coverage and accuracy) can be discovered by using aggregate predicates in the background knowledge. To do this, aggregate predicates are defined in first-order logic and used in C2D. 7. Numerical attributes are handled in a more efficient way. The clauses having comparison operators on numerical attributes are defined and used in the main algorithm. 8. When the target concept has common attribute types with only some of the background predicates, the rest of the predicates (which are called unrelated relations) can never take part in hypothesis. This prevents the generation of transitive rules through such predicates. In order to remove this drawback, we extended the generalization mechanism in such a way that the indirectly related facts of the target concept instance are added to Apriori lattice to allow transitive rules in the hypothesis. This paper is organized as follows: Section 2 presents the related work. Section 3 gives an overview of the algorithm of C2D. Section 4 introduces aggregate predicates and presents aggregation in C2D. Section 5 gives the basic definitions for transitive rule construction and presents generating transitive rules in C2D. Finally, Section 6 includes concluding remarks.
نتیجه گیری انگلیسی
This work presents a concept discovery system, named C2D, whih combines rule extraction methods in ILP and Apriori-based specialization operator. By this way, strong declarative biases are relaxed, instead, support and confidence values are used for pruning the search space. In addition, C2D does not require user specification of input/output modes of arguments of predicates and negative concept instances. Thus, it provides a suitable data mining framework for non-expert users who are not expected to know much about the semantic details of large relations they would like to mine, which are stored in classical data base management systems C2D has a new confidence-based hypothesis evaluation criterion and confidence-based search space pruning mechanism. Conventional definition of the confidence, that is developed for query extension, is slightly modified by adding type predicates to the body of the query extension for the arguments of the head predicate that do not appear in the body. By this way, the domains are also included in the confidence calculation of the generated rules in concept discovery. Confidence-based pruning is used in the candidate filtering phase. If the confidence value of the generated clause is not higher than confidence values of its parents, it means that, the specifications through it will not improve the hypothesis to be more confident. By this way, such clauses are directly eliminated at early steps. The proposed system is tested on several benchmark problems including the same-generation, mesh design, predictive toxicology evaluation and mutagenicity test. The experiments reveal promising test results that are comparable with the performance of current state-of-the-art knowledge discovery systems. Another important result deduced from the experiments is that, most of the time f-metric generates better rules than using only confidence as an evaluation criterion. In addition, it is observed that, in sparse data sets, it is important to select the parameters (such as min-conf threshold) to induce good results. To be able to include aggregate predicates, one-to-many relationships in the schema are given and aggregate predicates on such relations are activated through predefined aggregate queries. The experiments on inclusion of aggregate predicates show that aggregate rules and comparison on numerical data provide improvement with respect to similar concept discovery systems. Although the inclusion of aggregate predicates slightly reduces the run-time efficiency, this can considered negligible with respect to the increase in the accuracy of concept descriptions. In this work, we also present a technique to generate rules that capture transitive relations. This is done by including unrelated facts in rule definitions in the first step (the most general rule generation step) of the concept discovery process. By this way, it is possible to generate rules that capture transitive relations through unrelated facts and to extract transitive rules under missing background information. Such rules are not very common, however, they become of importance for the cases where they are the only rules that describe the target concept. Another contribution of this technique is the ability to induce rules under missing background information. Similar relational knowledge discovery systems can generate transitive rules as well. However, they use mode declarations for the attributes, which requires high level logic programming and domain knowledge. Under the proposed enhancements, C2D handles transitive rule generation without mode declarations. Inclusion of the unrelated facts extend the search space. However, experimentally, it is observed that this extension has a very little effect on the execution time. This feature of the system is also optional, and it can be turned on only for the applications containing domains that have unrelated facts in the background information.