In the research reported in this paper, transient faults were injected in the nodes and in the communication subsystem (by using software fault injection) of a commercial parallel machine running several real applications. The results showed that a significant percentage of faults caused the system to produce wrong results while the application seemed to terminate normally, thus demonstrating that fault tolerance techniques are required in parallel systems, not only to assure that long-running applications can terminate but also (and more important) that the results produced are correct. Of the techniques tested to reduce the percentage of undetected wrong results only ABFT proved to be effective. For other simple error detection methods to be effective, they have to be designed in, and not added as an after thought. Faults injected in the communication subsystem proved the effectiveness of end-to-end CRCs on the data movements between processors.
Subject
Parallel and Distributed Computing
Conference
Twenty-Six Annual International Symposium on Fault-Tolerant Computing (FTCS-26), June 1996
Cited by
Year 2004 : 3 citations
1. Charng-da Lu, Daniel A. Reed, "Assessing Fault Sensitivity in MPI Applicationsâ?, Proceedings of the ACM/IEEE SC2004 Conference (SC'04), Pittsburgh, PA, USA, 2004.
2. Gong Su, "MOVE: Mobility with Persistent Network Connectionsâ?, tese de doutoramento, Columbia University, USA, 2004.
3. Charng-da Lu; Reed, D.A., "Assessing Fault Sensitivity in MPI Applicationsâ?, Proceedings of the ACM/IEEE SC2004 Supercomputing Conference, (High Performance Computing, Networking and Storage Conference) 2004 Page(s):37 " 37.
Year 2003 : 2 citations
1. Constantinescu C., "Experimental evaluation of error-detection mechanisms", IEEE Transactions on Reliability, 52 (1): pp. 53-57 March 2003.
2. A. Benso, S. Di Carlo, G. Di Natale, P. Prinetto, L. Tagliaferri, "Data Criticality Estimation In Software Applications?, IEEE ITC International Test Conference, 2003.
Year 2001 : 1 citations
1. Wee Teck Ng and Peter M. Chen "The Design and Verification of the Rio File Cache" IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 4, APRIL 2001.
Year 2000 : 2 citations
1. Cristian Constantinescu, "Teraflops supercomputer: Architecture and validation of the fault tolerance mechanisms?, IEEE Transactions on Computers, vol. 49, no. 9, Setembro de 2000, pp. 886-894.
2. A. Benso, S. Chiusano, P.Prinetto, L. Tagliaferri "A C/C++ Source-to-Source Compiler for Dependable Applications" Proc. of DSN-2000 - The International Conference on Dependable Systems and Networks (FTCS-30, DCCA-8), 25-28 June 2000, New York, USA, IEEE Computer Society Press, ISBN 0-7695-0707-7, pp. 71-78.
Year 1999 : 7 citations
1. Constantinescu, C.; "Using physical and simulated fault injection to evaluate error detection mechanisms" Pacific Rim International Symposium on Dependable Computing, Proceedings. 1999 , 16-17 Dec. 1999, Pages:186 - 192, IEEE Computer Society.
2. Seungjae Han, Kang G. Shin "Experimental Evaluation of Behavior-Based Failure-Detection Schemes in Real-Time Communication Networks" IEEE Transactions On Parallel And Distributed Systems, Vol. 10, No. 6; JUNE 1999, pp. 613-626, ISSN 1045-9219.
3. Lettner R, Prammer M, Scherrer C, Steininger A, "Assessment of computer fault tolerance - a fault-injection toolset and the rationale behind it?, Computer Standards & Interfaces, Elsevier Science BV, Amsterdam, vol. 21, no. 4, Setembro 1999, pp. 357-369, ISSN: 0920-5489.
4. Wee Teck Ng e Peter Chen, "The Systematic Improvement of Fault Tolerance in the Rio File Cache?, Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-29, 15 a 18 de Junho de 1999, Madison, Wisconsin, EUA, IEEE Computer Society Press, pp 76-83, ISBN 0-7695-0213-X;
5. Cristian Constantinescu, "Assesssing Error Detection Coverage by Simulated Fault Injection?, Third European Dependable Computing Conference, EDCC-3, Praga, República Checa, Setembro de 1999, Lecture Notes in Computer Science 1667, Spriger-Verlag, pp. 161-170.
6. Alfredo Benso, Maurizio Rebaudengo, Matteo Sonza Reorda "Fault Injection for Embedded Microprocessor-based Systems", Journal of Universal Computer Science, Volume 5 / Issue 10, October 1999, Springer, pp. 693-711.
7. Wee Teck Ng, "Design and Implementation of Reliable Main Memory?, tese de doutoramento, University of Michigan, USA, 1999.
Year 1998 : 5 citations
1. A. Benso, P. Prinetto, M. Rebaudengo and M. Sonza Reorda "EXFI: a low-cost fault injection system for embedded microprocessor-based boards" ACM Transactions on Design Automation of Electronic Systems, Volume 3 , Issue 4 (1998) pp 626-634.
2.M. Kaaniche, L. Romano, Z. Kalbarczyk, R.K. Iyer and R. Karcich, "A Hierarchical Approach for Dependability Analysis of a Commercial Cached RAID Storage Architecture?, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Technical Report UILU-ENG-98-2205. ARPA, Fevereiro 1998.
3. N. Neves and W. K. Fuchs, "Coordinated Checkpointing without Direct Coordination," Proceedings of IEEE International Computer Performance & Dependability Symposium, pp. 23--31, Sept. 1998.
4. Cristian Constantinescu "Validation of the Fault/Error Handling Mechanisms of the Teraflop Computer" Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-28, June 1998, Munich, Germany, IEEE Computer Society Press, pp 382-389, ISBN 0-8186-8470-4.
5. M. Kaaniche, L. Romano, Z. Kalbarczyk, R. Iyer, R. Karcich " A Hierarchical Approach for Dependability Analysis of a Comercial Cache-Based RAID Storage Architecture" Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-28, June 1998, Munich, Germany, IEEE Computer Society Press, pp 6-15, ISBN 0-8186-8470-4.
Year 1997 : 7 citations
1. Uwe Wildner "CASC - Compiler Assisted Self-Checking of Structural Integrity" Tese de doutoramento, Institut für Informatik, Universität Potsdam, Rep. Federal da Alemanha, Outubro 1997.
2. D. T. Stott, M-C Hsueh, G. L. Ries, R. K. Iyer "Dependability Analysis of a Commercial High-Speed Network" Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-27, 24-27 June 1997, Seattle, Washington, EUA, IEEE Computer Society Press, pp 248-257, ISBN 0-8186-7831-3.
3. D. Blough, T. Torii "Fault-Injection-Based Testing of Fault-Tolerant Algorithms in Message Passing Parallel Computers" Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-27, 24-27 June 1997, Seattle, Washington, EUA, IEEE Computer Society Press, pp. 258-267, ISBN 0-8186-7831-3.
4. A. Steininger, C. Scherrer "On Finding an Optimal Combination of Error Detection Mechanisms Based on Results of Fault Injection Experiments" Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing Symposium, FTCS-27, 24-27 June 1997, Seattle, Washington, EUA, IEEE Computer Society Press, pp 238-247, ISBN 0-8186-7831-3.
5. Uwe Wildner "Experimental Evaluation of Assigned Signature Checking With Return Address Hashing on Different Platforms" Proc. 6th IFIP Working Conference on Dependable Computing for Critical Applications (DCCA-6), Grainau, Alemanha, Março 5-7, 1997, IEEE Computer Socienty Press, ISBN 0-8186-8009-1, pp. 3-18.
6. N. Neves e W. Kent Fuchs, "Fault detection using hints from the socket layer", Proceedings of the IEEE Symposium on Reliable Distributed Systems, Outubro de 1997, pp. 64-71;
7. John Chapin, "Hive: Operating System Fault Containment For Shared-Memory Multiprocessors?, Tese de Doutoramento, Technical Report No. CSL-TR-97-712, Computer Systems Laboratory, Dept. of Electrical Engineering and Computer Science, Stanford University, Janeiro de 1997;
Year 1996 : 2 citations
1. N. Neves e W. K. Fuchs, "Using time to improve the performance of coordinated checkpointing", Proceedings of the International Computer Performance & Dependability Symposium, Setembro de 1996, pp. 282-291;
2. David Powell Michel Cukier Jean Arlat e Yves Crouzet, "Estimation of Time-Dependent Coverage?, LAAS-CNRS, 7 avenue du Colonel Roche - 31077 TOULOUSE Cedex 4, France - Research Report 96466 " 4 de Dezembro de 1996;