CISUC

An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine

Authors

Abstract

Coordinated checkpointing represents a very effective solution to assure the continuity of distributed and parallel applications in the occurrence of failures. In previous studies it has been proved that this approach achieved better results than independent checkpointing and message logging. However, we need to know more about the real overhead of coordinated checkpointing and get sustained insights about the best way to implement this technique of fault-tolerance. This paper presents an experimental evaluation of coordinated checkpointing in a parallel machine. It describes some optimization techniques and presents some performance results.

Subject

Checkpointing

Conference

Third European Dependable Computing Conference (EDCC-3), September 1999


Cited by

Year 2000 : 1 citations

 1. Julia L. Lawall, Gilles Muller, "Efficient Incremental Checkpointing of Java Programs" Proc. of DSN-2000 - The International Conference on Dependable Systems and Networks (FTCS-30, DCCA-8), 25-28 June 2000, New York, USA, IEEE Computer Society Press, ISBN 0-7695-0707-7, pp. 61-70.