CISUC

Experimental evaluation of the fail-silent behavior in computers without error masking

Authors

Abstract

Traditionally, fail-silent computers are implemented by
using massive redundancy (hardware or software). In this
research we investigate if it is possible to obtain a high
degree of fail-silent behavior from a computer without
hardware or software replication by using only simple
behavior based error detection techniques. It is assumed
that if the errors caused by a fault are detected in time it
will be possible to stop the erroneous computer behavior,
thus preventing the violation of the fail-silent model. The
evaluation technique used in this research is physical fault
injection at the pin level. Results obtained by the injection
of about 20000 different faults in two different target
systems have shown that 1) in a system without error
detection up to 46% of the faults caused the violation of
the fail-silent model; 2) in a computer with behavior
based error detection the percentage of faults that caused
the violation of the fail-silent mode was reduced to values
from 2.3% to 0.4%; 3) the results are very dependent on
the target system, on the program under execution during
the fault injection and on the type of faults.

Subject

Fault Injection

Conference

24th Fault Tolerant Computing Symposium FTCS-24, June 1994


Cited by

Year 2007 : 2 citations

 Khanna, G.; Mike Yu Cheng; Varadharajan, P.; Bagchi, S.; Correia, M.P.; Verissimo, P.J., "Automated Rule-Based Diagnosis Through a Distributed Monitor System," Dependable and Secure Computing, IEEE Transactions on , vol.4, no.4, pp.266-279, Oct.-Dec. 2007

 Raul Barbosa, "Multi-Layer Fault Tolerance for Distributed Real-Time Systems?, Thesis for the Degree of Licentiate of Engineering, Thesis for the Degree of Licentiate of Engineering, Göteborg, Sweden 2007

Year 2006 : 2 citations

 Basile, C.; Kalbarczyk, Z.; Iyer, R.K., "Active replication of multithreaded applications," Parallel and Distributed Systems, IEEE Transactions on , vol.17, no.5, pp. 448-465, May 2006

 Roshan G. Ragel, "Architectural Support for Security and Reliability in Embedded Processors?, PhD thesis, The University of New South Wales, Austrália, 2006

Year 2005 : 1 citations

 1. Barbosa, J. Vinter, P. Folkesson, J. Karlsson, "Assembly-Level Pre-Injection Analysis for Improving Fault Injection Efficiency," to appear in Proc. Fifth European Dependable Computing Conference (EDCC-5), (Budapest, Hungary) April 2005

Year 2004 : 1 citations

 1. R. Barbosa, J. Vinter, P. Folkesson and J. Karlsson, "Fault Injection Optimization through Assembly-Level Pre-Injection Analysis", Tech. Report No. 04-07, Dept. of Comp. Eng., Chalmers University of Technology, Göteborg, Sweden, 2004.

Year 2003 : 5 citations

 1. Claudio Basile, Z. Kalbarczyk, R. Iyer, "A preemptive deterministic scheduling algorithm for multithreaded replicas", IEEE/IFIP International Conference on Dependable Systems and Networks, Dependable Computing and Communications, DSN-DCC 2003, San Francisco, CA, USA, pp. 149-158, June 22-25, 2003.

 2. R. Iyer, "Design of Reliable Systems and Networks", ECE 442/CS 436, Course Papers, Spring 2003, Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1308 W. Main Street, Urbana, IL 61801. (http://courses.ece.uiuc.edu/ece442/papers.htm)

 3. Kaufman L. M., Salinas M. H., Williams R. D., Giras T. C., "Integrate hardware/software device testing for use in a safety-critical application", Annual Reliability and Maintainability Symposium, pp. 132-137 2003.

 4. S. Bagchi, Z. Kalbarczyk, R. Iyer, Y. Levendel, "Design and Evaluation of Preemptive Control Signature (PECOS) Checking for Distributed Applications?, in IEEE Transactions on Computers, 2003. (Science Citation Index Expanded).

 5. Claudio Basile, Long Wang, Zbigniew Kalbarczyk, Ravi Iyer, "Group Communication Protocols under Errors?, Proceedings of the 22nd IEEE International Symposium on Reliable Distributed Systems (SRDS"03), 2003.

Year 2002 : 2 citations

 1. Ravishankar K. Iyer, Zbigniew Kalbarczyk, "Hardware and Software Error Detection", Book Chapter (44 pages), used in classes at the Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1308 W. Main Street, Urbana, IL 61801, 2002.

 2. Steininger A, Scherrer C "Identifying efficient combinations of error detection mechanisms based on results of fault injection experiments" IEEE T COMPUT 51 (2): 235-239 FEB 2002

Year 2001 : 4 citations

 1. Pascal Chevochot, Isabelle Puaut "Experimental evaluation of the fail-silent behavior of a distributed real-time run-time support built from COTS components" Proceedings of the 2001 International Conference on Dependable Systems and Networks, 1-4 July 2001, Göteborg, Sweden, IEEE Computer Society, ISBN 0-7695-1101-5, pp. 304-313.

 2. Paul L. Springer "Assessing Application Vulnerability to Radiation Induced DEUs in Memory" Supplement of the 2001 International Conference on Dependable Systems and Networks, 1-4 July 2001, Göteborg, Sweden, IEEE Computer Society, pp B-98,B-99.

 3. David T. Stott, Neil A. Speirs, Zbigniew Kalbarczyk, Saurabh Bagchi, Jun Xu, Ravishankar K. Iyer "Comparing Fail-Silence Provided by Process Duplication versus Internal Error Detection for DHCP Server" in Proc. of Intl' Parallel and Distributed Processing Symp. (IPDPS), San Francisco, CA, April. 2001.

 4. I-Ling Yen, Farokh B. Bastani, David J. Taylor, "Design of Multi-Invariant Data Structures for Robust Shared Accesses in Multiprocessor Systems" IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 3, MARCH 2001, pp. 193-207.

Year 2000 : 4 citations

 1. Yutao He, Algirdas Avizienis "Assessment of the Aplicability of COTS Microprocessors in High-Confidence Computing Systems: A Case Study" Proc. of DSN-2000 - The International Conference on Dependable Systems and Networks (FTCS-30, DCCA-8), 25-28 June 2000, New York, USA, IEEE Computer Society Press, ISBN 0-7695-0707-7, pp. 81-86.

 2. Pascal Chevochot, Isabelle Puaut, Gilbert Cabillic, Antoine Colin, David Decotigny, Michel Banatre "HADES: A distributed System for Dependable Hard Real-Time Applications Built from COTS Components", IRISA, RENNES, FRANCE, Tech Rep 1357, 2000, ISSN 1166-8687.

 3. Subhachandra Chandra "An Evaluation of the Recovery-Related Properties of Software Faults" PhD Thesis, University of Michigan, 2000.

 4. P. Chevochot and I. Puaut. Experimental evaluation of the fail-silent behavior of a distributed real-time run-time support built from COTS components. Technical Report 1370, IRISA, Oct. 2000

Year 1999 : 5 citations

 1. I. L. Yen, I. Ahmed, R. Jagannath, S. Kundu, "The design and implementation of a customizable fault tolerance framework?, International Journal of Software Engineering and Knowledge Engineering, vol. 9, no. 2, Abril 1999, pp. 181-202, World Scientific Publisher, ISSN 0218-1940.

 2. Lettner R, Prammer M, Scherrer C, Steininger A, "Assessment of computer fault tolerance - a fault-injection toolset and the rationale behind it?, Computer Standards & Interfaces, Elsevier Science BV, Amsterdam, vol. 21, no. 4, Setembro 1999, pp. 357-369, ISSN: 0920-5489.

 3. Seungjae Han, Kang G. Shin "Experimental Evaluation of Behavior-Based Failure-Detection Schemes in Real-Time Communication Networks" IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Vol. 10, No. 6; JUNE 1999, pp. 613-626.

 4. Carlos Pérez, Germán Fabergat, R. J. Martínez e G. Martín, "Incremental Messages: Micro-Kernel Services for Flexible and Efficient Management of Replicated Data?, Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-29, 15 a 18 de Junho de 1999, Madison, Wisconsin, EUA, IEEE Computer Society Press, pp 56-63, ISBN 0-7695-0213-X.

 5. R. J. Martínez, P. J. Gil, G. Martín, C. Pérez, J.J. Serrano. "Experimental Validation of High-Speed Fault-Tolerant Systems Using Physical Fault Injection". Dependable Computing and Fault Tolerant Systems. Volume 12.A. Avizienis, H. Kopetz, J.C. Laprie Eds. IEEE Computer Society. ISBN 0-7695-0284-9, pp. Pág 249-264, 1999

Year 1998 : 3 citations

 2. I-L. Yen, I. Ahmed, R. Jagannath, and S. Kundu, "Implementation of a Customizable Fault Tolerance Framework" in The First IEEE International Symposium on Object-Oriented RealTime Distributed Computing, Kyoto, Japan, April, 1998.

 3. A. Benso, M. Rebaudengo, L. Impagliazzo, P. Marmo, "Fault-List Collapsing for Fault Injection Experiments", RAMS98: Annual Reliability and Maintainability Symposium, Anaheim, CA (USA), pp. 383-388, January 1998.

 1. Christopher Temple "Avoiding the Babbling-Idiot Failure in a Time-Triggered Communication System" Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-28, June 1998, Munich, Germany, IEEE Computer Society Press, pp 218-227, ISBN 0-8186-8470-4.

Year 1997 : 5 citations

 1. A. Bondavalli, S. Chiaradonna, F. Di Giandomenico and S. La Torre "Modelling the Effects of Input Correlation in Iterative Software," in Reliability Engineering and System Safety Journal, Elsevier Science, 1997, number 57, pp. 189-202.

 2. A.Steininger, C. Scherrer "On Finding an Optimal Combination of Error Detection Mechanisms Based on Results of Fault Injection Experiments" Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-27, 24-27 June 1997, Seattle, Washington, EUA, IEEE Computer Society Press, pp 238-247, ISBN 0-8186-7831-3.

 3. Uwe Wildner "CASC - Compiler Assisted Self-Checking of Structural Integrity" Tese de doutoramento, Institut für Informatik, Universität Potsdam, Rep. Federal da Alemanha, Outubro 1997.

 4. S. Han, K. G. Shin "Experimental Evaluation of Failure Detection Schemes in Real-Time Communication Networks" Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-27, 24-27 June 1997, Seattle, Washington, EUA, IEEE Computer Society Press, pp 122-131, ISBN 0-8186-7831-3.

 5. I.-L. Yen, "An Object-Oriented Fault Tolerance Framework Based on Specialization Techniques,? Proc. Third IEEE Computer Soc. Workshop Object-Oriented Real-Time Dependable Systems, pp. 291-297, Newport Beach, Calif., Feb. 1997.

Year 1996 : 4 citations

 1. Lovric T., "Detecting hardware faults with systematic and design diversity: Experimental results?, Computer Systems Science and Engineering, vol. 11, no. 2, Março de 1996, pp: 83-92, ISSN: 0267-6192.

 2. Emmerich Fuchs "An Evaluation of the Error Detection Mechanisms in MARS Using Software-Implemented Fault Injection" Proc. Second European Dependable Computing Conference, Taormina, Itália, Outubro 1996, Lecture Notes in Computer Science 1150, Springer Verlag, pp. 73-90, ISBN 3-540-61772-8.

 3. Jürgen Bohne, Reny Grönberg "Adaptable Fault Tolerance for Distributed Process Control Using Exclusively Standard Components" Proc. Second European Dependable Computing Conference, Taormina, Itália, Outubro 1996, Lecture Notes in Computer Science 1150, Springer Verlag, pp. 21-34, ISBN 3-540-61772-8.

 4. Benso A, Corno F, Prinetto P, Rebaudengo M, Reorda MS, "Role of fault injection techniques in system dependability analysis?, AEI Automazione Energia Informazione, UTET Periodici Scientifici, Milano vol. 83, no. 10, Outubro 1996, pp. 63-69, ISSN: 0013-6131.

Year 1995 : 3 citations

 1. Jens Guethoff, Volkmar Sieh "Combining Software-Implemented and Simulation-Based Fault-Injection into a Single Fault Injection Method" Proceedings of the 25th International Symposium on Fault-Tolerant Computing (FTCS-25), California, USA, June 1995, pp. 196-206, IEEE Computer Society.

 2. Rudi Cuyvers, "User-Adaptable Fault Tolerance for Message Passing Multiprocessors", tese de doutoramento, Katholieke Universiteit Leuven (Lovaina), Bélgica, Maio de 1995.

 3. Carlos Pérez Conde, "Aportaciones a los Entornos de Desarrollo de Aplicaciones Tolerantes a Fallos?, tese de doutoramento, Departamento de Informática e Electrónica, Universidade de Valência, Espanha, 1995;