CISUC

Algorithm Based Fault Tolerance versus Result-Checking for Matrix Computations

Authors

Abstract

Algorithm Based Fault Tolerance (ABFT) is the collective name of a set of techniques used to determine the correctness of some mathematical calculations. A less well known alternative is called Result Checking (RC) where, contrary to ABFT, results are checked without knowledge of the particular algorithm used to calculate them.
In this paper a comparison is made between the two using some practical implementations of matrix computations. The criteria are performance and memory overhead, ease of use and error coverage. For the latter extensive error injection experiments were made. To the best of our knowledge, this is the first time that RC is validated by fault injection.
We conclude that Result Checking has the important advantage of being independent of the underlying algorithm. It also has generally less performance overhead than ABFT, the two techniques being essentially equivalent in terms of error coverage.

Keywords

Result-checking, ABFT, Fault injection, Error Detection, Matrix operations.

Subject

Algorithm Based Fault Tolerance

Conference

FTCS - 29, June 1999


Cited by

Year 2003 : 1 citations

 1. ÿrjan Askerdal "On Impact and Tolerance of Data Errors with Varied Duration in Microprocessors" PhD Thesis, Departament of Computer Engineering, Chalmers University of Technology, Sweden, 2003, ISBN 91-7291-285-5.

Year 2001 : 4 citations

 Ahmad A. Al-Yamani, Nahmsuk Oh, Edward J. McCluskey " Performance Evaluation of Checksum-Based ABFT" IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'01), October 24 - 26, 2001, San Francisco, California.

 Behrooz Parhami "Approach to component-based synthesis of fault-tolerant software" Informatica 25 (2001) 533"543

 John A. Gunnels, Daniel S. Katz, Enrique S. Quintana"Ortí, Robert A. van de Geijn "Fault"Tolerant High"Performance Matrix Multiplication: Theory and Practice" Proceedings of the 2001 International Conference on Dependable Systems and Networks, 1-4 July 2001, Göteborg, Sweden, IEEE Computer Society, ISBN 0-7695-1101-5, pp. 47-56.

 Ahmad A. Al-Yamani, Nahmuk Oh, Edward J. McCluskey "Algorithm-Based Fault Tolerance: A performance perspective based error rate" Supplement of the 2001 International Conference on Dependable Systems and Networks, 1-4 July 2001, Göteborg, Sweden, IEEE Computer Society, pp B-108,B-109.

Year 2000 : 1 citations

 1. Michael Turmon, Robert Granat, Daniel S. Katz, "Software-Implemented Fault Detection for High-Performance Space Applications" Proc. of DSN-2000 - The International Conference on Dependable Systems and Networks (FTCS-30, DCCA-8), 25-28 June 2000, New York, USA, IEEE Computer Society Press, ISBN 0-7695-0707-7, pp. 107-116.