دانلود مقاله ISI انگلیسی شماره 7174
ترجمه فارسی عنوان مقاله

مدل سازی قابلیت اطمینان سیستم زمان واقعی سخت با استفاده از روش مسیر ــ فضا

عنوان انگلیسی
Reliability modeling of a hard real-time system using the path-space approach
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
7174 2000 10 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Reliability Engineering & System Safety, Volume 68, Issue 2, May 2000, Pages 159–168

ترجمه کلمات کلیدی
قابلیت اطمینان سیستم - سیستم زمان واقعی - مهلت سخت - روش مسیر ــ فضا - فرآیند نیمه مارکوف
کلمات کلیدی انگلیسی
پیش نمایش مقاله
پیش نمایش مقاله  مدل سازی قابلیت اطمینان سیستم زمان واقعی سخت با استفاده از روش مسیر ــ فضا

چکیده انگلیسی

A hard real-time system, such as a fly-by-wire system, fails catastrophically (e.g. losing stability) if its control inputs are not updated by its digital controller computer within a certain timing constraint called the hard deadline. To assess and validate those systems’ reliabilities by using a semi-Markov model that explicitly contains the deadline information, we propose a path-space approach deriving the upper and lower bounds of the probability of system failure. These bounds are derived by using only simple parameters, and they are especially suitable for highly reliable systems which should recover quickly. Analytical bounds are derived for both exponential and Weibull failure distributions encountered commonly, which have proven effective through numerical examples, while considering three repair strategies: repair-as-good-as-new, repair-as-good-as-old, and repair-better-than-old.

مقدمه انگلیسی

A “hard” real-time system is characterized by a stringent timing requirement, which should be met to avoid any catastrophe [1]. This timing information must, therefore, be accounted for when the reliability of a hard real-time system is modeled or measured. By embedding this information, reliability modeling of those systems can also handle temporary malfunctions caused, for example, by ElectroMagnetic Interference (EMI) [2]. One class of examples is a real-time control system where the dynamics of the controlled plant/process (robots, nuclear reactors, or paper mills) keep the plant within the safe region if the controller malfunctions do not persist too long. In real-time control systems such as aircraft or satellites, the system should be directed by an appropriate controller computer in a timely manner; that is, its control input should be updated by the controller computer within a time limit called the hard deadline[3]. For safety-critical applications this property has led to highly redundant/reconfigurable controllers. In some conventional reliability models for digital control systems, temporal periods of controller misbehaviors were ignored while assuming that a (perfect) controller should always be failure-free to manage the underlying controlled plant. Other models have captured the details of such systems by focusing only on the states of fault-tolerant controller computers treating a temporary controller failure as a total system failure regardless of the requirements of the controlled plant [4] and [5]. That is, they erred on the safe side by ignoring the “system inertia” or system resilience in tolerating temporary loss of the controller. However, it is possible for a system/plant to survive repeated controller perturbations temporarily because of plant dynamics and inertia. In contrast, in the paper we deal primarily with a system failure resulting from temporal controller upsets in consideration of system inertia specified by the deadline information. In other words, the system failure is caused due to slow recoveries of controller misbehaviors taking more than the hard deadline that intrinsically depends upon the plant dynamics [3] and [6], where neither of the inter-arrival time of controller failures nor the recovery time is always exponentially distributed and the failure rate is substantially affected by the holding time in the state of controller failure(s). Note that the failure rate is also dependent on the “global time” (the total operating time of the system) in more general systems. There were also some previous works that considered the deadline/timing information for reliability modeling. In Ref. [7], a Markov model, which does not only describe component-failure behaviors but also incorporates deadline violations as simple transitions, was used to measure system reliability by deriving only the probability of missing a deadline, while needing another computation in a different ‘lower-level’ model. The authors of Ref. [8] considered non-failure-critical cases, where some system-down time can be tolerated if it is recovered within a certain deadline. They derived the mean value of the system lifetime and the cumulative operational time for the case of bounded repair time (restricted by the deadline). However, it is difficult to derive the distribution from the Laplace–Stieltjes transform of the system lifetime, although it is easy to compute the mean value. Hence, it is intractable to derive system reliability using these results. Moreover, none of these considered such general cases as when the time-to-failure and/or time-for-repair are not exponentially distributed. Although these general cases were modeled by a time-non-homogeneous Markov chain [9], a semi-Markov process [10] or a Markov regenerative process [11], none of these dealt with the case when the failure rate depends on the total operation time of the system. These general models can be computed by the Monte Carlo method, but, since the Monte Carlo method is just a statistical estimation through numerical simulation, it is computationally very expensive. To overcome these obstacles, we consider a path-space approach which was not only treated in queuing theory [12] but also proven useful in solving other reliability modeling problems [13]. Our goal is to derive tight upper and lower bounds for the probability of a system failure in terms of two simple parameters; (i) the probability of k (k>0) interruptions during the operating period T—for instance this can be estimated by using field data or a certain analytic model can be built like our previous work [2] evaluating the susceptibility of controller computers against EMI inducing upsets—and (ii) the probability of successful recovery (before the hard deadline) given an interruption. For the first parameter, computing the probability of k events during a time period is straightforward, and there are analytical formulas for some of the more popular probability distributions [14]. For the second parameter, using the probability of successful recovery has three advantages; (i) it is mathematically more tractable than the density function for the recovery time that is required by the Chapman–Kolmogorov equations, (ii) it is experimentally and statistically less demanding to obtain the binomial parameters of failures than to do curve fitting for a density function, and (iii) it permits model reduction because it reduces encountered complexity, that is, multi-state recovery models to a single state with jump probabilities to successful recovery or unsuccessful recovery. Despite all of these simplifications, it is shown by proper examples that this approach yields tight bounds for a wide variety of models. It is especially suitable for the stiff models of highly reliable systems. The recovery/repair procedure begins at the start of an interruption. There can be a time lag between the occurrence of an interruption and the beginning of the actual repair, but this time lag is included in the repair-time distribution. Henceforth, in the models recovery begins when the system enters a “down” state, often called the recovery/repair state. The probability distribution for the recovery time is also fixed for a given model. That is, it is assumed that recovery is either an automated procedure or done by a repair crew that does not become either more proficient or fatigued. These properties of the recovery/repair procedures imply that the time to recovery depends only on the time since entering the recovery/repair state. Hence, recovery/repair is captured by a semi-Markov process for all the models described below, even if the distribution for system malfunctions is dependent on the global time.

نتیجه گیری انگلیسی

The paper proposed a path-space approach to the problem of modeling the reliability of a hard real-time system embedded with the deadline information. The path-space approach in combination with the repetitive nature of the semi-Markov model yields convenient formulas and straightforward computational techniques. The main results are the upper and lower bounds for the probability of a system failure that use only simple parameters dealing with complicated models through a simple canonical form for analyses. An important feature of the path-space approach is that it can be extended to handle global-time dependent failure distributions, which are beyond the reach of semi-Markov models and the associated Chapman–Kolmogorov equations. We considered a spectrum of repair strategies: repair-as-good-as-new, repair-as-good-as-old, and the general repair-better-than-old, where both deterministic and random hard deadlines are considered as well. A variety of field examples are presented to demonstrate the effectiveness of the path-space approach and the derivation procedure. Because of the reliance on only simple parameters and the ease of reducing semi-Markov models, this approach is suitable for complex models involving timing information such as the hard deadlines.