ترجمه فارسی عنوان مقاله

تجزیه و تحلیل تحمل خطا و قابلیت اطمینان در معماریهای سیستم زمان واقعی توزیع شده

عنوان انگلیسی

Analysis of fault tolerance and reliability in distributed real-time system architectures

کد مقاله	سال انتشار	تعداد صفحات مقاله انگلیسی
7183	2003	12 صفحه PDF

منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Reliability Engineering & System Safety, Volume 82, Issue 2, November 2003, Pages 195–206

ترجمه کلمات کلیدی

- مدل سازی سیستم - شبیه سازی - تزریق خطا - شبکه های پتری - افق اطلاعات

کلمات کلیدی انگلیسی

دانلود رایگان 2 صفحه اول مقاله لاتین (PDF)

پیش نمایش مقاله

چکیده انگلیسی

Safety critical real-time systems are becoming ubiquitous in many areas of our everyday life. Failures of such systems potentially have catastrophic consequences on different scales, in the worst case even the loss of human life. Therefore, safety critical systems have to meet maximum fault tolerance and reliability requirements. As the design of such systems is far from being trivial, this article focuses on concepts to specifically support the early architectural design. In detail, a simulation based approach for the analysis of fault tolerance and reliability in distributed real-time system architectures is presented. With this approach, safety related features can be evaluated in the early development stages and thus prevent costly redesigns in later ones.

مقدمه انگلیسی

Distributed real-time systems are increasingly becoming more common in areas, which can be attributed as critical for different reasons. Systems from such diverse sectors as aviation, process control, telecommunications, electronic commerce and others have in common that not only functional, but also timely failures may have severe impacts on monetary and environmental scales, in the worst case even including the loss of human life. Clearly, such systems have to be designed as fault tolerant, i.e. in case of faults potentially catastrophic consequences have to be prevented by design. In this context, the development of systems with redundancy has a long tradition to achieve fault tolerance and thus reliability. The basic idea of redundant systems design is that specific components have redundant counterparts. In case of a fault, such a design ensures the overall correct timely and functional behaviour of a system. A non-trivial task in this context is to develop an architecture for a system which meets specific safety requirements. It is commonly agreed upon that fault tolerance measures should be considered as early as possible in the course of systems development. Otherwise, it is likely that integrating redundancy in later development stages is more expensive and, even worse, not as effective in comparison to earlier introduced measures [1] and [2]. The reason for this is that (partially) existing systems usually impose tight restrictions on the introduction of redundancy into the overall architecture in later development stages. In earlier development stages, on the other hand, it is a non-trivial task to design and comparatively evaluate alternative architectures for safety critical distributed real-time systems. Since potential problems and their consequences within a complex system cannot be easily predicted at early development stages, it is particularly difficult to decide which components of a system have to have redundant counterparts and which do not. The cause for this situation is that the increasing complexity of computer controlled systems and their dependability requirements in fact have exposed the limits of validation techniques traditionally used for safety and reliability analysis, like fault trees or Failure Mode and Effects Analysis [3]. Furthermore, there is barely any tool support for the evaluation of architectures in early design stages. Both leads to the following observation in Ref. [4]: “While manufacturers of [safety] critical systems are prepared and have a considerable experience on the validation of their products, there are major difficulties and much less experience in the early validation of system design.” In order to help this situation, the high level analysis of real-time system architectures with respect to fault tolerance and reliability in the very early design stages is the subject of discourse in the remainder of this article. Even if functional aspects cannot be taken into account in early development stages, nevertheless simulation models suited for evaluation purposes can be built. Such models are based on the distributed topology of a system and provide the possibility to examine its architecture with respect to the consequences in case of single and multiple failures. Starting from this perspective the structure of the article is as follows. Section 2 introduces the concept of information horizon as a basis for the analysis of fault tolerance and reliability in distributed real-time system architectures. In Section 3, a brake-by-wire system from the automotive industry is described, modeled and finally analyzed with a fault injection enabled simulation model. Section 4 concludes the presentation and describes further perspectives of the presented approach.

نتیجه گیری انگلیسی

In this article, an approach was presented for the analysis of distributed real-time system architectures with respect to fault tolerance and reliability. To be able to measure the impact of different kinds of system failures to the components in a distributed real-time system, the concept of information horizon was introduced. The information horizon of a component tells, to what extend the locally needed data for optimal decision making is available. In combination with a fault injection enabled simulation, architectures of distributed real-time systems can be evaluated with this concept in early design stages. As functional aspects are not taken into account, simulation results are upper bounds for fault tolerance and reliability. These bounds are not necessarily reached by actual implementations, and also cannot be improved by implementations without architectural modifications. The applicability of the approach was illustrated with an example from the automotive industry. The principle usefulness of the presented approach increases with the complexity of architectures to evaluate. In industrial systems, several tens to hundreds of distributed nodes and hundreds to thousands different message types are not uncommon. To reach an in-depth understanding of the consequences of (partial) failures of components and communication systems in such a scenario is very demanding. The presented approach for the fault injection enabled simulation of such systems therefore helps to evaluate the architecture of distributed real-time systems at a high level of abstraction in early stages of the development process. In these stages, the redesign of systems is significantly less cost intensive in comparison to architectural changes for already existing prototypes. Consequently, the introduced measure for fault tolerance and reliability potentially reduces the overall development costs for safety critical systems. Safety analysis methods are traditionally classified as top-down, e.g. fault trees [13], and bottom-up, like FMEA [14]. In both classification areas either cause or effect of a failure are known and thus used as a starting point for further analysis. A more fine grained classification of safety analysis methods is given in Ref. [15], where it is argued that also approaches are needed for exploratory safety analysis in case neither cause nor effect of failures are known. As this is especially the case in the very early stages of architectural design, the approach presented in this article exactly fits into this category and thus complements the more traditional methods mentioned above. Other related work for the analysis of real-time systems is generally more low level focused with, e.g. the analysis of worst case execution times [16] and schedulability [17] and [18]. Simulation based fault injection is also a prominent area of research, where a plethora of related work exists. Classifications of methods and overviews of the state of research in fault injection and fault tolerant systems are given in Refs. [2], [19] and [20]. Generally, fault injection techniques can be categorized into approaches based on hardware, software and simulation. The former two need at least partially working prototypes of a system to be applicable, and are thus targeted towards later development stages. The majority of simulation based approaches makes use of VHDL as underlying system description language [21]. The reason for this is the availability of simulation tools for VHDL models with built in fault injection capabilities like VERIFY[22] and MEFISTO-C[23]. A hybrid approach for the development of distributed real-time systems which combines Petri-Nets for behavioral modeling and VHDL for fault simulation purposes is described in Ref. [1]. VHDL based approaches for fault injection generally have in common that they are focused on more detailed models of a system where faults are injected, e.g. on pin/signal or memory levels. The approach described above abstracts from such implementation details and therefore complements more traditional methods in earlier design stages. A related problem with VHDL based fault injection methods which might be overcome with the presented approach is that considerable efforts are needed to set up VHDL based fault injection experiments (see e.g. Ref. [21] for a more detailed discussion). The approach presented in this article is a ‘proof of concept’ which illustrates the principle applicability of the concept of information horizon to the high level analysis of real-time system architectures in early design stages. For the experimental validation of the underlying ideas, the presented Petri-Net models were manually built and simulated with the open source tool RENEW. In order to make the approach more accessible to developers not being familiar with formal methods in general and Petri-Nets in particular, more sophisticated tool support will be developed in the future. Developers will be able then to graphically design real-time system architectures as connected components without having to know about the underlying formalism. With the above introduced patterns for the representation of components and communication systems between them as connected Petri-Net modules, the formal model can be automatically derived from the graphical user input. This tool will also support interactive fault injections, in order to allow the designer to observe the consequences within the simulation. A useful extension in this context is to complement the interactive fault injection with a feature for fault forecasting, i.e. the automatic derivation of failure combinations for a given model [24]. If the activities in a model are either instantaneous, deterministic or exponentially distributed, the result is a so called deterministic stochastic Petri-Net (DSPN). As DSPNs can be analyzed by existing tools, the transformation of a model into the input format of e.g. TimeNet[25] would be valuable. A tool combining all these features finally supports a light-weight process [26] for the formally based analysis of real-time system architectures in early design stages. The method supported by such a tool is intended to complement existing frameworks for the development of safety critical systems. Due to its modular approach, it is especially useful in the area of component based development processes [27] and [28].