کشف مدل های فرایند از نمونه هایشان
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
471 | 2002 | 17 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Decision Support Systems, Volume 34, Issue 1, December 2002, Pages 41–57
چکیده انگلیسی
A thorough understanding of the way in which existing business processes currently practice is essential from the perspectives of both process reengineering and workflow management. In this paper, we present a framework and algorithms that derive the underlying process model from past executions. The process model employs a directed graph for representing the control dependencies among activities and associates a Boolean function on each edge to indicate the condition under which the edge is to be enabled. By modeling the execution of an activity as an interval, we have developed an algorithm that derives the directed graph in a faster, more accurate manner. This algorithm is further enhanced with a noise handling mechanism to tolerate noise, which frequently occur in the real world. Experimental results show that the proposed algorithm outperforms the existing ones in terms of efficiency and quality.
مقدمه انگلیسی
The research on process discovery traces its origin to grammar discovery in the early 1970s [4]. The goal was to identify the underlying grammar from a finite number of sample strings. Grammars are commonly represented as finite state machines (FSMs). After a correct FSM is identified, it can then be used to tell the correctness of a given input string. More recently, researchers have started to adopt the existing grammar discovery algorithms to the problem of process discovery [9] and [11]. The idea was to treat an execution of a process as a string of events, each of which represents an execution outcome of an involving activity. With several executions of the same process as the input, these algorithms will be able to synthesize a process definition that best satisfies these historical data. Process definitions were described in the form of FSMs. For example, consider Fig. 1(a) for an example FSM of the program development process [9], which involves three sequential steps: code modification, compilation, and testing. After the code is modified (G), the subsequent compilation is performed and produces the result of either OK (I) or not OK (H). If the compilation is not okay, the code has to be modified again and the procedure has to be repeated; otherwise, a testing activity is performed. A successful testing (K) ends this process, and a failed testing (J) calls for the repetition of the entire procedure. Fig. 1(b) and (c) show the FSMs discovered from two different algorithms, KTAIL and Markov [9], respectivelyThe original FSMs like the one shown in Fig. 1(a) are very easy to perceive and can be converted to an activity-based process model without much difficulty. As a matter of fact, in Fig. 1(a), each state corresponds to the execution of exactly one real-world activity, and its outgoing transitions represent the possible execution results. However, in a derived FSM, such as that in Fig. 1(b) or (c), a state may not have its clear semantic meaning, and an execution outcome may appear in the transitions of multiple states. While a derived FSM can be used to verify the validity of a given execution, it would be difficult to infer the underlying process model from this FSM. Realizing the inadequacy of FSM as a process model, Agrawal et al. [2] took a different approach, which produced a process definition in the form of a directed graph, with vertices being the set of constituent activities and edges being the set of control dependencies among them. As the directed graph is the main process model adopted by most workflow systems [27], the output of their algorithm can be readily employed in a commercial workflow system. The idea of their algorithm is quite simple. Each activity in a process execution is represented as an instantaneous event, say the start or end of it. With this simplification, an execution of a process can be represented as a sequence of activities. Their algorithm then tries to locate all possible dependencies from each execution instance. For example, the execution {ABDCE} implies the following dependencies: {A→B, A→D, A→C, A→E, B→D, B→C, B→E, D→C, D→E, C→E}. Dependencies derived from all process executions are first unionized, but those that appear in both directions are then removed. That is, for any pair of activities A and B, if both dependencies A→B and B→A are found from some (but different) executions, then the two activities are pronounced independent. Transitive reduction is finally performed for each process instance to induce a minimal graph. This final step dominates the whole process in terms of running time. The total time complexity is O(N×n3), where N is number of instances and n is number of activities of a process. Their algorithm uses a simple mechanism for handling cycles. Appearances of the same activity in an execution are first treated as distinct activities. After the algorithm described above is performed for identifying all transitions, vertices of the same activity are finally merged as a single one. However, this simple method may unnecessarily remove some dependencies in some cases. Consider the example process shown in Fig. 2, in which a cycle exists between activities B and E, and an or-split branch follows activity B. In one process instance, suppose B and C are executed 10 times and 4 times, respectively. In another instance, B and C are performed 10 times and 3 times, respectively. Let B10 denote the tenth occurrence of activity B and C3 the third occurrence of activity C in a process instance. From the first process instance, C3→B10 is inferred, while B10→C3 is induced from the second one. As a result, B10 and C3 are considered independent. In the final transitive reduction step, spurious transitions (e.g., B→E) would remain due to the incorrect conclusion on the relationship between B and C1.2. Contribution We present a novel approach to solve the problem of process discovery. Just as Agrawal et al.'s algorithm [2] uses directed graphs as the process model for easier interpretation, so does our approach. Our approach not only handles cycles correctly but also produces more information in a faster manner. Specifically, the directed graph derived by our approach embodies control dependencies between activities as well as the conditions pertaining to them. These two major components together make our work a complete framework for process discovery. Unlike other work on process discovery that makes use of only an instantaneous event for each activity instance, our work assumes an interval for the execution of an activity instance. For those processes whose executions are either automated (via WFMSs) or monitored (through some monitoring tools), the execution intervals of their constituent activities are already available. However, manual processes may provide merely partial information such as the completion times of their activities. In this case, the starting time of an activity instance can still be inferred by subtracting the anticipated execution duration from its completion time. Consequently, the derived intervals may contain noise, which must be screened out by some noise handling mechanism, as will be discussed in Section 5. With execution intervals of activities as the input, we shall be able to develop an algorithm that produces a process model closer to the real one in shorter time. This paper is structured as follows. Section 2 describes the process model and the kind of data we assume to be available. Section 3 presents two algorithms that derive the control dependencies and the associated conditions, respectively. Section 4 compares the performance of the proposed algorithm with that of the algorithm proposed by Agrawal et al. [2] by applying synthetic datasets to both algorithms. In Section 5, we consider a practical environment in which noisy data is present. Strategies for dealing with noise are discussed. Finally, Section 6 summarizes the paper and provides potential directions for future research
نتیجه گیری انگلیسی
We have presented a novel approach to discovering a process model from past executions. The process model includes two components: control dependencies represented by a directed graph and the conditions pertaining to these dependencies. We have shown through both theoretic analysis and experiments that our proposed algorithm for deriving the control dependencies executes faster and is able to return more accurate results. Using intervals for modeling activity executions contributes to this improvement. Besides, we also developed a noise handling mechanism to be incorporated into the control dependency derivation algorithm. Experimental results showed that the algorithm performed well under a noise-prone environment. We have also described how to induce conditions associated with dependencies via classification techniques. These two major components together make our work a complete framework for process discovery. Many applications are eligible for generating historical information in the form of interval sets. While the framework and algorithms proposed in this paper can be applied in some of the domains, they may not fit perfectly in others, partly due to an important assumption made by virtually all process discovery research: a unique underlying process model that encompasses all process instances is assumed to exist. Consider the project execution in several occasions that aims to achieve a similar goal. Each project execution is characterized by a Gantt Chart, which comprises a set of intervals. Project executions carried out at different corporations, though pursuing the same objective, may take a significantly different approach. Thus, we can no longer expect the existence of a single process model to which all process instances conform. What we can probably best hope for are some partial process models, each of which represents features exhibited by a significant number of process instances. To this end, this problem becomes somewhat like those discussed in the data mining (or knowledge discovery) community, especially the sequential pattern discovery problem. However, most research work in the context of sequential pattern discovery intends to find total orders on some constituent items, whereas we are interested in finding structures like the process model digraph described in this paper. The intrinsic sophistication of the process model digraph makes this problem challenging. We have embarked on a study of this problem, and a preliminary result has been reported in Ref. [28].