به سوی مهندسی داده کاوی رو به جلو: یک رویکرد مهندسی نرم افزار
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
22126 | 2009 | 21 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Information Systems, Volume 34, Issue 1, March 2009, Pages 87–107
چکیده انگلیسی
The number, variety and complexity of projects involving data mining or knowledge discovery in databases activities have increased just lately at such a pace that aspects related to their development process need to be standardized for results to be integrated, reused and interchanged in the future. Data mining projects are quickly becoming engineering projects, and current standard processes, like CRISP-DM, need to be revisited to incorporate this engineering viewpoint. This is the central motivation of this paper that makes the point that experience gained about the software development process over almost 40 years could be reused and integrated to improve data mining processes. Consequently, this paper proposes to reuse ideas and concepts underlying the IEEE Std 1074 and ISO 12207 software engineering model processes to redefine and add to the CRISP-DM process and make it a data mining engineering standard.
مقدمه انگلیسی
In its early days, software development focused on creating programming languages and algorithms that were capable of solving almost any problem type. The evolution of hardware, continuous project planning delays, low productivity, heavy maintenance expenses and failure to meet user expectations had led by 1968 to the stagnation of software development, causing what came to be known as the software crisis, the term coined at the first NATO conference on software development [1]. This crisis was caused by the fact that there were no formal methods and methodologies, support tools or proper development project management, all of which were standard techniques used in projects developed in other classical branches of engineering. The software community realized what the problem was and decided to borrow ideas from other fields of engineering, which it incorporated into software project development. This was the origin of software engineering (SE). As of then process models and methodologies for developing software projects began to materialize. Software process models describe the tasks to be performed to develop a software system, whereas development methodologies schedule the tasks and specify what methods to use to do the tasks [2]. Software development improved considerably as a result of the new methodologies. This solved some of its earlier problems, and little by little software development grew to be a branch of engineering. This shift means that project management and quality assurance problems are being solved. Additionally, it is helping to increase productivity and improve software maintenance. This is one of the major problems in software development, as it can amount to up to two-thirds of costs throughout the software system's lifetime [2]. The history of knowledge discovery in databases (KDD), now known as data mining (DM), is not much different, at least so far. In the early 1990s, when the KDD processing term was first coined [3], there was a rush to develop DM algorithms that were capable of solving all a company's problems of searching for knowledge in large volumes of data. Apart from developing algorithms, tools (Clementine [4], [5] and [6], IBM Intelligent Miner [7] and [8], Weka [9], DBMiner [10]) were also developed to simplify the application of DM algorithms and provide some sort of support for all the activities involved in the KDD process. From the viewpoint of DM process models, the year 2000 marked the most important milestone, as this was when the first standard and tool-independent DM process model was published. This standard is known as CRISP-DM (CRoss-Industry Standard Process for DM) [11] and [12]. The number of applied projects in the DM area is expanding rapidly [13]. This growth is confirmed by reports by the Gartner Group [14] and [15] and Forrester Research [16]. The Gartner Group estimates [14] that there will be an upsurge of DM projects over the next decade (over 300%) to improve customer relationships and help companies listen to customers. Another Gartner Group report [15] claims that enterprises in the DM area grew by 4.8% from 2005 to 2006, and DM is now the area in which companies are investing most. While it is true that a lot of DM projects are being developed, neither all the project results are in use [17], [18] and [19] nor do all projects end successfully [20] and [21]. The failure rate is actually as high as 60% [22]. Deployed by about 50% of respondents, CRISP-DM is the most commonly used methodology for developing DM projects [23], [24] and [25]. However, its use is not becoming any more widespread due to rivalry with other, in-house methodologies developed by work teams, which account for another, almost 30%. All the above goes to show that while CRISP-DM was an improvement on the earlier state of affairs, the process model is not perhaps yet mature enough to deal with the complexity of the problems it has to address. And this detracts from the effectiveness of its deployment, as it does not produce the expected results. Are we at the same point as SE was in 1968? Certainly not, but we do not appear to be on a par yet either, DM cannot be considered a mature field as SE [26]. Table 1 compares DM's history with SE's past. Looking at the KDD process and how it has progressed, we find that there is some parallelism with the advancement of software. From this viewpoint, DM project development is at stage 4, and is defining development methodologies to be able to cope with the new project types, domains and applications that organizations have to come to terms with. SE has reached stage 5, where development processes pay special attention to organizational, management or other parallel activities not directly related to development, such as project completeness and quality assurance. CRISP-DM has not yet been sized for these tasks, as it is very much focused on pure development activities and tasks.This paper is moved by the idea that DM problems are taking on the dimensions of an engineering problem. Therefore, the processes to be applied should include all the activities and tasks required in an engineering process, tasks that CRISP-DM might not cover. Our proposal is to enhance CRISP-DM by embedding other current standards, as suggested in [27], inspired by the work done recently in SE derived from other branches of engineering and from developer experience.
نتیجه گیری انگلیسی
The premise of this paper is that SE's maturity would mean that its standard processes, better tailored to the large and complex projects that are now being developed in the field of DM, would account for aspects not covered in DM's current development standard: CRISP-DM. After analyzing SE standards, IEEE Std 1074 or ISO 12207, we developed a joint model that we used to compare SE and DM procedures process by process and activity by activity. This comparison highlighted that CRISP-DM fails to address many tasks related to project management, organization and quality in enough detail to be able to deal with the complexity of projects now under development, if at all. These projects tend to involve not only the study of large volumes of data but also the management and organization of large interdisciplinary human teams. As a result, we proposed a process model for DM engineering that covers such aspects, making a distinction between what is a process model and what is a methodology and life cycle. The proposed process model is a correct and adequate organizational framework for DM project development activities, in which it is also specified which activities are already being carried out correctly (albeit organized differently) and which need to be improved or created from scratch. It includes all the activities covered in CRISP-DM, but spread across process groups according to more comprehensive and advanced standards of a better established branch of engineering with over 40 years of experience: SE. The validity or benefit of the proposed framework would not need to be demonstrated experimentally, because it follows from its validity and benefit when applied in other engineering projects, like SE projects. The model is not complete, as this paper merely states the need for the processes and especially the activities set out in IEEE Std 1074 or ISO 12207 but missing in CRISP-DM. The adaptation and detailed specification of these processes is outside the scope of this paper. This overview is the basis for further research. First, the processes that are missing or only partially covered by CRISP-DM need to be specified and tailored from their IEEE Std 1074 or ISO 12207 counterpart. Second, possible types of life cycle for a DM project need to be examined and specified. Some existing SE life cycles, like the waterfall, incremental or iterative life cycles, perhaps already exist in DM, but have not been identified as such; others will be exclusive to DM. Third, the process model specifies what to do, but not how to do it. This is denoted by the methodology used, meaning that the different methodologies that are being used for each process (like the methodology proposed in DM industrial engineering or CRM catalyst) would have to be examined and tailored to the model. And, finally, any methodology has a number of associated techniques and tools. Many such techniques and tools have already been developed in DM (such as Clementine or the neural networks technique), but others have not. As they are well established in SE (e.g., configuration management or business process modeling formal specification, techniques and tools), it would be worthwhile looking at how they could be adapted for DM processes.