Algorithm Based Fault Tolerance versus Result-Checking for Matrix Computations
Authors
Abstract
Algorithm Based Fault Tolerance (ABFT) is the collective name of a set of techniques used to determine the correctness of some mathematical calculations. A less well known alternative is called Result Checking (RC) where, contrary to ABFT, results are checked without knowledge of the particular algorithm used to calculate them.In this paper a comparison is made between the two using some practical implementations of matrix computations. The criteria are performance and memory overhead, ease of use and error coverage. For the latter extensive error injection experiments were made. To the best of our knowledge, this is the first time that RC is validated by fault injection.
We conclude that Result Checking has the important advantage of being independent of the underlying algorithm. It also has generally less performance overhead than ABFT, the two techniques being essentially equivalent in terms of error coverage.
Keywords
Result-checking, ABFT, Fault injection, Error Detection, Matrix operations.Subject
Algorithm Based Fault ToleranceConference
FTCS - 29, June 1999Cited by
Year 2003 : 1 citations
1. ÿrjan Askerdal "On Impact and Tolerance of Data Errors with Varied Duration in Microprocessors" PhD Thesis, Departament of Computer Engineering, Chalmers University of Technology, Sweden, 2003, ISBN 91-7291-285-5.
Year 2001 : 4 citations
Ahmad A. Al-Yamani, Nahmsuk Oh, Edward J. McCluskey " Performance Evaluation of Checksum-Based ABFT" IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'01), October 24 - 26, 2001, San Francisco, California.
Behrooz Parhami "Approach to component-based synthesis of fault-tolerant software" Informatica 25 (2001) 533"543
John A. Gunnels, Daniel S. Katz, Enrique S. Quintana"OrtÃ, Robert A. van de Geijn "Fault"Tolerant High"Performance Matrix Multiplication: Theory and Practice" Proceedings of the 2001 International Conference on Dependable Systems and Networks, 1-4 July 2001, Göteborg, Sweden, IEEE Computer Society, ISBN 0-7695-1101-5, pp. 47-56.
Ahmad A. Al-Yamani, Nahmuk Oh, Edward J. McCluskey "Algorithm-Based Fault Tolerance: A performance perspective based error rate" Supplement of the 2001 International Conference on Dependable Systems and Networks, 1-4 July 2001, Göteborg, Sweden, IEEE Computer Society, pp B-108,B-109.
Year 2000 : 1 citations
1. Michael Turmon, Robert Granat, Daniel S. Katz, "Software-Implemented Fault Detection for High-Performance Space Applications" Proc. of DSN-2000 - The International Conference on Dependable Systems and Networks (FTCS-30, DCCA-8), 25-28 June 2000, New York, USA, IEEE Computer Society Press, ISBN 0-7695-0707-7, pp. 107-116.