دانلود مقاله ISI انگلیسی شماره 21795
ترجمه فارسی عنوان مقاله

کشف فرایند نرم افزار به منظور استفاده از تجزیه و تحلیل جریان کار تصادفی

عنوان انگلیسی
Discovering the software process by means of stochastic workflow analysis
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
21795 2006 9 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Journal of Systems Architecture, Volume 52, Issue 11, November 2006, Pages 684–692

ترجمه کلمات کلیدی
فرآیند نرم افزار - مدیریت جریان کار - پویایی تصادفی - زنجیره مارکوف - یادگیری ماشین - روش بیزی - تجزیه و تحلیل زمان دنباله - معیارهای شباهت
کلمات کلیدی انگلیسی
Software process, Workflow management, Stochastic dynamics, Markov chains, Machine learning, Bayesian methods, Time-sequence analysis, Similarity measures
پیش نمایش مقاله
پیش نمایش مقاله  کشف فرایند نرم افزار به منظور استفاده از تجزیه و تحلیل جریان کار تصادفی

چکیده انگلیسی

A fundamental feature of the software process consists in its own stochastic nature. A convenient approach for extracting the stochastic dynamics of a process from log data is that of modelling the process as a Markov model: in this way the discovery of the short/medium range dynamics of the process is cast in terms of the learning of Markov models of different orders, i.e. in terms of learning the corresponding transition matrices. In this paper we show that the use of a full Bayesian approach in the learning process helps providing robustness against statistical noise and over-fitting, as the size of a transition matrix grows exponentially with the order of the model. We give a specific model–model similarity definition and the corresponding calculation procedure to be used in model-to-sequence or sequence-to-sequence conformance assessment, this similarity definition could also be applied to other inferential tasks, such as unsupervised process learning.

مقدمه انگلیسی

The software industry shows an increasing awareness of the need to invest in accountability and quality of information technology (IT) services, providing all methodological support to ensure that IT-related processes will achieve companies’ business goals. One of the main issues to be addressed is the lack of a solid IT governance methodology, dealing with software process control on the basis of fine-grained process metrics. In principle, business process specifications should include a well-understood workflow, designed before enactment and adjusted whenever change happens. Applying top-down workflow analysis to a business process definition would bring the details of that process into focus, specifying by whom, where and when each business process activity is carried out. In the software development practice, however, analysis and tuning are mostly performed on the high level model of the process, whereas little is done about the underlying workflows: they are less known and therefore less defined, tuneable or manageable. Lack of workflow analysis makes software process governance difficult and sometimes ineffective. In this paper, we shall discuss a bottom-up approach where workflow knowledge is inferred from process log data by means of process mining. The term process mining refers to a collection of methods for distilling a structured process description from low-level process metrics and workflow traces. When purposely designed logs are available [1] and [2], process mining results in some form of a posteriori process model that can be compared with the a priori model, supporting process tuning and fine-grained analysis. Generally speaking, process mining can be used to exploit non-purposely designed information sources (e.g. collaborative development and design environments, or transactional system) in order to extract indicators suitable for governance purposes (e.g. as defined by the Control Objectives for Information and Related Technology process, COBIT [3]), or for maturity scales like the one introduced by the Software Engineering Institute’s Capability Maturity Model (CMM) [4]. Conceptually, the starting point of process mining is an extended process log, containing information about the process as it is actually being executed. The extended process log is assumed to have the following properties: (i) each event refers to a task (i.e., a well-defined step in the workflow), (ii) each event refers to a case (i.e., a workflow instance), and (iii) events are totally ordered. Usually, existing process logs need to be integrated (e.g., by encoding them in a XML data format) and carefully filtered before fulfilling those requirements. Potentially valuable sources for setting up an extended process log include, for instance – but are not limited to – collaborative development and design environments, project management tools and WFM (workflow management) systems. In order to make process log information useful, some synthetic knowledge will have to be drawn from raw data, either in the form of local patterns or in terms global features of the dynamics of the process. A key issue is that of mapping this synthetic knowledge onto a specific software process representation. Much work has been done on highly expressive metadata for process engineering, including reusable, modular ontologies for process description [5]. In our overall approach [6], a process-independent metamodel is instantiated in specific process description models whose instances are then linked to synthetic knowledge, deterministically drawn from process data. However, experience has shown that the process of learning a process model from empirical data must have a statistical character to cope with an apparent variability in the manifestation of an underlying process dynamics. In this paper, we focus on dealing with the software process stochastic nature when extracting synthetic knowledge from process logs. A time-honoured approach for extracting the stochastic dynamics of a process from extended log data is that of modelling the process as a Markov model (further candidate models will be proposed during the discussion): the discovery of the short or medium range dynamics of the process can then be cast in terms of the learning of Markov models of different orders, i.e. in terms of learning the corresponding transition matrices. However, since the size of a transition matrix grows exponentially with the order of the model, the lack of statistics can become a problem: adopting a full Bayesian approach in the learning process can provide the required robustness against statistical noise and against the risk of over-fitting. In the next sections, after a review of the state of the art (Section 2), we provide the basic definitions of Markov models and the basic formulas for their inference from data, for the instance and model comparison (for validation/conformance assessment) and give a specific similarity definition and the corresponding calculation procedure (Section 3), some experimental results obtained over the only time texture of a real world development process will show the effectiveness of this technique in process validation (Section 4), a discussion and the outline of possible developments will close the paper (Section 5).

نتیجه گیری انگلیسی

When trying to measure the software process by means of non-invasive software probes, one obtains fine-grained high-dimensional log data which are not easy to use to perform common inferential tasks, such as process validation or classification. Furthermore, those data are of difficult interpretation, due to the stochastic in nature of the software process. In this paper we argue that a suitable approach to the challenge of the stochastic nature of the process is that of modelling the process as a Markov chain and of using Bayesian methods for learning its structure and for carrying on further inferential tasks such as validation and unsupervised classification of software process instances. Our approach exploits local, short-range, step-to-step correlations present in the process data as well as medium-range correlations (as we go up with the order of the Markov model). We assessed the effectiveness of this approach by comparing sub-sequences and discriminating between those coming from the same main sequence and those coming from a different main sequence, for different sub-sequence lengths; the main sequence consisted in the time texture of the activities of different developers, collected by means of non-invasive probes triggered by events generated within the IDE, such as class or method change. The experimental results obtained from our data samples show that this approach is valuable in recognizing when two sequences have been produced by the same developer or by two different developers: the technique can provide a high level of precision and recall for Markov models up to the first few orders. In the future, more advanced models will be used to achieve a more realistic modelling of the development process and of the development process measurement. For instance, Hidden Markov Models (HMM) could be used to take into account the fact that the collected log events are just an indirect effect of the domain-meaningful development activities, and that any given log event is not deterministically amenable to a single activity. If interested into actual sequence of software process phases or activities, one could gain some information through a strict file naming policy; alternatively one could probabilistically associate the usage of some application to a phase or activity of the software process and then infer from the sequence of applications the most likely sequence of activities actually taking place. Other issues still to be faced are additive noise (captured activities not related to the software process), duplicated events and data losses. Further work will also be done, in order exploit semi-structured information processing techniques approaches to knowledge extraction from data flows. These basic techniques can be customized and used to set up an inference chain leading from fine-grained to coarse-grained process data to high-level concepts, e.g. the ones related to quality and stability. Finally, traditional product metrics (e.g. size, complexity and cohesion) could be integrated with statistical properties of process event data (such as deviation from programming and process standards) to identify level of services corresponding to quality and maintainability thresholds. An integrated process-product log could enable monitoring of outsourced development contract, as well as quality and cost trends estimation over time, and could support a novel, quantitative approach to “ongoing” application outsourcing management. We plan to develop these issues in our future research.