کشف و تجسم داده و تجزیه و تحلیل عملکرد از جریان کار شرکت
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|21800||2007||18 صفحه PDF||سفارش دهید||محاسبه نشده|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Computational Statistics & Data Analysis, Volume 51, Issue 5, 1 February 2007, Pages 2670–2687
This work was motivated by a recent experience where we needed to develop enterprise operational reports when the underlying business process is not entirely known, a common situation for large companies with sophisticated IT systems. We learned that instead of relying on human knowledge or business documentation, it is much more reliable to learn from the flow structure of event sequences recorded for work items. An example of work items are product alarms detected and reported to a technical center through a remote monitoring system; the corresponding event sequence of a work item is an alarm history, i.e. the alarm handling process. We call the flow of event sequences recorded for work items, workflow. In this paper, we developed an algorithm to discover and visualize workflows for data from a remote technical support center, and argue that workflow discovery is a prerequisite for rigorous performance analysis. We also carried out a detailed performance analysis based on the discovered workflow. Among other things, we find that service time (e.g. the time necessary for handling a product alarm) fits the profile of a log-mixture distribution. It takes at least two parameters to describe such a distribution, which leads to the proposed method of using two metrics for service time reporting. Gadget timed out while loading
1.1. Motivation In this study, we analyze data collected from the ticketing system used to support a business that specializes in the maintenance of communication equipments. Here, the work items are product alarms detected and reported to a technical support center through a remote monitoring system. Our goals are two-fold. First, we want to understand the structure of the workflow. In other words, we try to discover and reconstruct the underlying workflow by analyzing the event sequence data recorded for the product alarms. It is important to realize that the term workflow often means different things to different people. Here, we take an operational view where the discovered workflow represents how the system is actually used to route work items, whereas workflow in the conventional sense often means how it is specified on paper in some business process document. There are always non-trivial discrepancies between the two versions, even for well-implemented systems. Furthermore, processes are often modified or “re-engineered” as the business grows and evolves. It is very difficult, if not impossible, for ordinary users to keep track of all the changes. Through automated workflow discovery, we turn dull data into valuable information and greatly increase our ability to manage the underlying business process. Both the business and its customers can benefit from such an increased level of transparency. Our second goal is to understand the traffic patterns and to measure business performance based on the discovered workflow. Here, we treat the workflow as a queueing system where work items arrive at random times. After arrival, the work items can be routed to different queues and handled by different groups of human agents. Specifically, we study the stochastic characteristics of the arrival process, as well as the queueing and service time distributions. The basic points we try to make in this paper are the following: (1) one cannot talk about performance analysis or business measurements without first understanding the business process. During the course of this study, we have encountered many instances where analysts who provide information to management, or even managers who are running the business, often do not have an accurate understanding of the workflow that drives the business; (2) the knowledge of business process often does not come directly from people. In other words, instead of relying solely on human knowledge, as is often the case, we can discover much of the business process by analyzing the system log data; (3) reasonable reporting of business process performance requires understanding the statistical distribution of process times. Conventional process metrics such as “mean-time-to-X” can be very misleading. We argue that two metrics are needed in order to accurately capture time-related statistics. 1.2. Data analysis In our analysis, we first use directed graphs to visualize the flow structure of the alarm handling processes. This directed graph enables us to learn how most of the alarms are solved and how much time it takes for an alarm to go through various stages (e.g. expert teams of technicians). We can also adapt the graphs for a subset of alarms or/and for a subset of expert teams. Second, as in Brown et al. (2005), we describe an alarm process through three components: arrival time, service time and waiting time. The arrival time can be used to measure the traffic volume in a work center. A standard assumption is that the arrival times of random events follow an inhomogeneous Poisson process. Similar to Brown et al. (2005), we adopt a non-parametric test to validate the Poisson assumption. We find that the inter-arrival times for alarms tend to be shorter than what could be produced by a Poisson process. But since the deviation from the Poisson assumption is small, we argue that the Poisson model is reasonable in practice. Once we validate the model assumptions, we further need to model the volume of processed alarms—the arrival alarm rate. For our data, we find that the arrival rate follows a weekly cycle but not a daily cycle. This cycle can be estimated through an additive linear model. The service time is closely related to the performance of a work center. We find that the log service time for alarms follows a mixture of normal distributions. We identify three mixture components: 70% of the alarms are handled in 22 min, 24% of the alarms are handled in less than a minute, and 6% of the alarms are handled in more than 1 day. The waiting time can be viewed as a measure of service quality. We find that the waiting time for product alarms follows a log-normal distribution rather than an exponential distribution as predicted in queueing theory. In fact, in terms of waiting time, there are two components: major and minor alarms. One immediate consequence is that the service quality cannot be measured accurately using conventional statistics such as mean-time-to-response. Also, using survival analysis techniques, we estimate that approximately 54% of the product alarms routed to a queue are transient in nature and would clear by themselves given enough time. In reality, we cannot afford to wait forever. Service providers have an obligation to respond when a problem is reported. To improve efficiency, we can leverage technologies such as artificial intelligence since many transient problems can be cleared by automated expert systems. 1.3. Summary of the paper The remaining parts of the paper are organized as follows: in Section 2, we describe the data to be analyzed. In Section 3, we describe the algorithm that will be used to discover and to visualize workflows. Sections 4–6 are devoted to the study of traffic patterns while treating the discovered workflow as a queueing system. In Section 7, we describe an approach for designing operational reports that do not use conventional metrics such as “mean-time-to-X”.
نتیجه گیری انگلیسی
This project was started by a simple request from the alarm team to produce operational reports “just like what they use in call centers”. What this refers to is a report showing metrics related to volume, service quality, and productivity, etc. It soon turned out that in order to do so, one first has to understand the business process and try to match that knowledge with the available data, i.e., how the process is implemented. This leads to the notion of workflow discovery and visualization. Later on, as we started to develop reports based on user requests, we found that conventional “time-to-X” type of metrics can be completely misleading. Using mean or median, for example, the report frequently ends up saying that it takes several hours for a technician to process an alarm. According to the operations managers, it should not take more than half an hour to handle a typical alarm during normal business hours. So what went wrong? As it turns out, this discrepancy is caused mainly by the presence of unobserved heterogeneity among the alarms routed to the queue. A small but significant percentage of alarms are kept open for several days, which severely skews performance measures derived from conventional statistics. This leads to the notion of performance analysis as we discussed earlier. As a direct result of this study, we were able to design and implement a set of operational reports that meet the user requirements while simultaneously accommodating the unusual traffic patterns of the alarm team workflow. A unique feature of these reports is that we use two numbers to measure most of the time related metrics. This is based on the empirical finding that many of the time intervals follow a log-mixture distribution. Using the popular “mean-time-to-X” metrics implicitly assumes that the underlying distribution can be characterized by a single parameter (e.g. the exponential distribution). Even if we replace mean with median, the underlying assumption is still the same. In our case, at least two parameters are necessary to characterize a log-mixture distribution. The solution we propose is to choose a truncation threshold. For time intervals below the threshold, we use the mean to measure time. For time intervals exceeding the threshold, instead of measuring time, we simply report the proportion of cases exceeding the threshold. In theory, one can estimate the parameters in the individual components of the mixture distribution and use these parameters to measure process time. Technically, this will be a more accurate way to describe process time distribution. However, for business reporting, this improved accuracy is unlikely to add more practical value. First, it will be much more difficult to interpret the reports for the average business person. Secondly, the parameters in the second component of the mixture distribution often has little practical meaning. All we need to know is that a certain percentage of alarms take much longer to resolve. Exactly how long is not important. Table 3 shows a sample daily report where the threshold is chosen to be h hours (for confidentiality reasons, we shall not specify the value of h).