پس از پردازش: پل زدن شکاف بین مدل سازی و تصمیم پشتیبانی موثر.شبکه ارزیابی نمایش در رفتار انسانی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|28130||2013||7 صفحه PDF||سفارش دهید||4198 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Mathematical and Computer Modelling, Volume 57, Issues 7–8, April 2013, Pages 1633–1639
The importance of post-processing the results of clustering when using data mining to support subsequent decision-making is discussed. Both the formal embedded binary logistic regression (EBLR) and the visual profile’s assessment grid (PAG) methods are presented as bridging tools for the real use of clustering results. EBLR is a sequence of logistic regressions that helps to predict the class of a new object; while PAG is a graphical tool that visualises the results of an EBLR. PAG interactively determines the most suitable class for a new object and enables subsequent follow-ups. PAG makes the underlying mathematical model (EBLR) more understandable, improves usability and contributes to bridging the gap between modelling and decision-support. When applied to medical problems, these tools can perform as diagnostic-support tools, provided that the predefined set of profiles refer to different stages of a certain disease or different types of patients with a same medical problem, etc. Being a graphical tool, PAG enables doctors to quickly and friendly determine the profile of a patient in the everyday activity, without necessarily understanding the statistical models involved in the process, which used to be a serious limitation for wider application of these methods in clinical praxis. In this work, an application is presented with 4 functional disability profiles.
Knowledge discovery from data (KDD) is a discipline established by Fayyad in 1989 for: “The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” . KDD has quickly grown as a multidisciplinary research field, where advanced techniques from statistics, artificial intelligence, information systems, visualisation ⋯ are combined to provide effective knowledge acquisition from huge data bases (with dimensions never imagined before the Internet boom). According to Fayyad, KDD refers to high level applications that include concrete methods of data mining: “the overall process of finding and interpreting patterns from data, typically interactive and iterative, involving repeated application of specific data mining methods or algorithms and the interpretation of the patterns generated by these algorithms”. More than 20 years later, KDD also appears as a powerful methodological framework for modelling very complex phenomena or organisations (such as environmental processes or health-care systems) even when no massive data sets are available ; this might respond to the intrinsic KDD multidisciplinary approach, but also to the importance given to what happens before and after the analysis itself. In fact, KDD is marking the beginning of a new methodological paradigm: “Most previous work on KDD has focussed on [ ⋯] the data mining step. However, the other steps are of considerable importance for the successful application of KDD in practice ”. Indeed, prior and posterior analyses are essential to guarantee: (i) correct and valid results: proper data cleaning and data preparation is crucial for correctness, while the accurate interpretation of results enables a complete validation process; (ii) real impact on the target domain: even when results are correct and valid, intense post-processing is often required to make results understandable and useable by end-users (often lacking strong mathematical skills). The benefits of the multidisciplinary approach that is typically used in KDD as well as the synergies of AI methods and statistics for modelling different complex domains, are in  and . Despite the proliferation of new data mining methods, recent research  has shown that only a restricted set of DM methods (the more popular and simple) is being used in practice.  indicates a preference for applying qualitative models, such as decision trees or rules induction rather than regression or ANOVA. We presume this is more due to the understandability and usability of the final results than to the intrinsic performance of the model. A convenient post-processing of traditional statistical models can bring the results closer to non-expert users and make friendly a set of mathematical equations, often avoided by many decision-makers. It seems clear that effective decision support on the target domain is strongly affected by the understandability of the model. The importance of pre and post-processing steps is clearly recognised in the scientific community  and . However, such steps are currently undertaken in an informal manner in practice, and more research is required to systematise them . The consequences of neglecting preprocessing in clustering applications are analysed in  and . Although this is a general problem, in this work we illustrate the importance of the post-processing step and its impact on the usability of a mathematical model in the context of clustering and logistic regression. As organizations and systems become larger and more complex, managing them becomes an increasingly difficult activity. Understanding organizations and systems is crucial for management and decision support tools become increasingly important in decision-making processes. In many situations, understanding is very much improved by a profiling model, that can identify typical patterns or entities in the system and associate standard actions, protocols, treatments or decisions to every profile. Many of our previous works involved profiling and, in consequence, clustering techniques. Also, in  clustering appears as the most used DM method for KDD in unsupervised contexts. However, clustering results consist of a list of classes and their object components. The gap between finding the best clustering result and being able to use it for everyday decision-making is enormous. The proposal presented in this work tries to reduce this gap. Usability requires understanding of the profiles, establishing a protocol/model to recognise them, and providing a friendly tool to predict profiles for new objects. For these needs, we have developed a tool that, given a set of profiles in a domain obtained using a clustering process, can easily, and quickly predict the profile of a new entity. The problem is presented in Section 2. In Section 3 we present the embedded binary logistic regression (EBLR) method to post-process clustering as a combination of statistical models to recognise the classes. The profile’s assessment grid (PAG) is introduced in Section 4 as a visually friendly alternative for non-expert users. It is an interpretative tool that graphically enables the immediate identification of the profile corresponding to a single object. It is useful for predictive purposes (diagnoses or classification), and can be used in clinical praxis. In Section 5 EBLR and PAG are used to find patterns of functional dependency in elderly patients, and Section 6 discusses conclusions and future work.
نتیجه گیری انگلیسی
There is a growing need to model increasingly complex organisations and systems. KDD offers a multidisciplinary framework to extract useful decisional knowledge from these systems. The importance of pre and post-processing steps in KDD processes is well-known, but these steps are still underdeveloped. The usability of data mining results is critical in practice and we think that the suitable post-processing of traditional statistical models can decrease the observed end-user preference for qualitative models. Post-processing is of key importance in reducing the gap between modelling and effective decision support in real applications. In this work the impact of post-processing on clustering results in profiling activities is discussed. Clustering is the most used DM method for unsupervised-KDD, and supports profiling tasks, that contribute to a better understanding of complex systems and organisations. However, the gap between clustering results and real decision-making support is enormous. Usability requires models to be understandable, profiles to be recognizable and new objects profiles friendly predictable. The problem is formalised in section Section 2. The (EBLR) method is a post-processing of the clustering results, based on a combination of embedded logistic regression models. It supports the recognition of new object profiles, provided that the set of profiles can be sorted. EBLR is based on IkIk succession and the construction of embedded logistic models over them, according to the natural ordering of the classes. In many real applications, the pure EBLR statistical formulation is too difficult for the general end-user, and procedural formulation is proposed as a more operative tool, that can be also incorporated into software. The PAG is proposed as a subsequent post-processing of the EBLR models. It is a graphical transformation of the EBLR equations into a unitary cube, and labelled with the assignment profiles that enable non-expert end-users to use the underlying EBLR, without statistical skills. The PAG seems to be a promising tool for supporting efficient decision-making in clinical praxis. It conserves the quality of the underlying EBLR model and relies on a threshold εε to control the type-I error (and associated costs) of decision making. Values around 0.5 are recommended, unless clear criteria in other directions exist. A proper gradation of colours enables the incorporation of the RR-ordering in the picture. The PAG provides information about the reliability of the decision and it is also useful for following the evolution of objects over time, since it provides a graphical model for the system dynamics. The paper illustrates the contribution of PAG to the assignment of rehabilitation treatment for disabled patients, a critical population in modern public health. Results of a previous profiling have been used and the PAG supports the classification of a new patient using just seven relevant items from the 96 included in the original WHO-DASII scale, with a 91.7% goodness of fit. At present, an extension to use the PAG for more than four profiles is being developed using data from the field of mental health systems, and once the EBLR equations are found, a new independent sample will be used to test the quality of the recommendations. Currently ε=0.5ε=0.5 is always used. In the future, guidelines for choosing the best εε will be established.