استنتاج وابستگی عملکردی و جاسازی شده: نقطه نظر داده کاوی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|22028||2001||30 صفحه PDF||سفارش دهید||14060 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Systems, Volume 26, Issue 7, November 2001, Pages 477–506
The issue of discovering functional dependencies from populated databases has received a great deal of attention because it is a key concern in database analysis. Such a capability is strongly required in database administration and design while being of great interest in other application fields such as query folding. Investigated for long years, the issue has been recently addressed in a novel and more efficient way by applying principles of data mining algorithms. The two algorithms fitting in such a trend are TANE and Dep-Miner. They strongly improve previous proposals. In this paper, we propose a new approach adopting a data mining point of view. We define a novel characterization of minimal functional dependencies. This formal framework is sound and simpler than related work. We introduce the new concept of free set for capturing source of functional dependencies. By using the concepts of closure and quasi-closure of attribute sets, targets of such dependencies are characterized. Our approach is enforced through the algorithm FUN which is particularly efficient since it is comparable or improves the two best operational solutions (according to our knowledge): TANE and Dep-Miner. It makes use of various optimization techniques and it can work on very large databases. Applying on real life or synthetic data more or less correlated, comparative experiments are performed in order to assess performance of FUN against TANE and Dep-Miner. Moreover, our approach also exhibits (without significant additional execution time) embedded functional dependencies, i.e. dependencies captured in any subset of the attribute set originally considered. Embedded dependencies capture a knowledge specially relevant in all fields where materialized data sets are managed (e.g. materialized views widely used in data warehouses).
In order to guarantee data consistency, integrity constraints are essential. Various types of integrity constraints have been studied ,  and . Among them, it is widely recognized that functional dependencies are the most important and common semantic constraints encountered in databases ,  and . More precisely, a functional dependency between two sets of attributes (X,Y), denoted by X→Y, holds in a relation if values of the latter set are fully determined by values of the former . Functional dependencies are a key concept in relational theory and the foundation for data organization when designing relational databases ,  and . However, their existence within data sets is independent of the relational model, they are therefore essential as soon as large data sets must be stored, handled, and updated whatever the underlying data model is (object oriented, N1NF, multi-dimensional, etc.). Let us quote the normalization theory, proposed in , for object oriented models, and the extension of functional dependencies in the context of nested relational models . This remark being done, the background in which the rest of the paper fits is the relational model , besides equivalent schemas are provided , ,  and . Discovering functional dependencies from existing databases is an important issue, investigated for long years , ,  and , and recently addressed with a data mining viewpoint, in a novel and much more efficient way ,  and . The approach presented in this paper fits in such a trend and aims to discover minimal functional dependencies, and embedded dependencies which are valid over a subset of the original data source. Before giving an overview of these recent contributions and ours, we describe in more depth the application fields, more than ever up to date, for which extracting functional dependencies from populated databases is critical and the ones for which carrying such dependencies into views, or more generally materialized data sets, is strongly required.
نتیجه گیری انگلیسی
In this paper, we investigate the problem of functional dependency inference by following the ideas recently introduced in related work, i.e. by adopting a data mining point of view. We attempt to draw a complete and current picture of the studied issue. In order to meet this objective, we place much more emphasis on the two best operational approaches, according to our knowledge, TANE and Dep-Miner . The approach that we propose fits in a similar trend since it applies principles of level-wise algorithms for mining data sets. It introduces a novel characterization of minimal FDs, soundly founded and much simpler than previous proposals. Actually, it is based on three mere concepts which come together for providing a sound and elegant characterization of minimal FDs (cf. Section 3). Furthermore, it defines rules proved to be correct and which can improve the underlying operational solution, specially when intending a level-wise algorithm. The associated implementation, the algorithm FUN, turns out to be a very efficient solution. For highlighting the advantages of our approach, we propose an in-depth comparison with TANE and Dep-Miner (cf. Section 8) and present various experimental results obtained under the very same conditions since the three algorithms were implemented using the very same programming style and development environment (cf. Section 6). These experiments show the efficiency of FUN when compared to TANE and Dep-Miner. Actually, 72 experiments were performed on synthetic data, varying the relation cardinality, its number of attributes and the data correlation rate. In 68 cases, FUN runs faster than the two other algorithms and Dep-Miner slightly outstrips FUN in only 4 cases. For real life data, FUN is always more efficient than TANE and Dep-Miner. Reasons explaining the behavior of the three algorithms are given in Section 8. Moreover, performed experiments show the good scale-up properties of FUN: its efficiency when compared to TANE and Dep-Miner increases in proportion as the number of attributes grows. When raising the correlation rate, this result is even more emphasized. Along with discovering the canonical cover of FDs, the three approaches provide additional and different capabilities. TANE deals with approximate FDs, Dep-Miner addresses the construction of Armstrong relations and FUN is interested in exhibiting embedded FDs. For discovering minimal embedded dependencies, our solution is straightforward from FUN results whereas the algorithm RBR  can be exponential depending on features of the initial FD set. Let us notice that RBR yields a canonical cover of embedded dependencies only if its input is a canonical cover. For discovering approximate dependencies, TANE makes use of an error measure g3 defined in . The concept of approximate FD is specially interesting when considering operational databases since a single tuple, perhaps erroneous or capturing an exception, can invalidate an FD. We extended our approach for mining approximate dependencies. In such a context, the concept of closure is extended by considering the error measure g3 instead of cardinality . Dep-Miner performs the generation of Armstrong relations which exemplify discovered FDs and facilitate their elucidation. In , it is shown that such relations could be very small and thus really relevant for end users. We would like to complement our approach in order to compute such relations and we are also interested in other types of dependency, such as inclusion or multivalued dependencies and in other related problems such as minimal key inference or 3NF and BCNF tests . Finally, our driving idea is to integrate these various capabilities within an aided tool for database administration and design. When building such a system, we aim to follow from the approach introduced in , which identifies a “common data centric step” for a broad class of data mining algorithms. It results in defining a generic middleware providing the basis for developing various solutions in an easy and not costly way. We believe that FUN, and specially the free set discovery, can offer such a generic middleware for addressing various problems in database analysis.