CISUC

Portable Checkpointing and Recovery

Authors

Subject

Checkpointing

Conference

4th IEEE International Symposium on High Performance Distributed Computing (HPDC-4), August 1995


Cited by

Year 2002 : 1 citations

 1. Karablieh, F. Bazzi, R.A., "Heterogeneous checkpointing for multithreaded applications", Proceedings. 21st IEEE Symposium on Reliable Distributed Systems, 2002, Tempe, Arizona, USA

Year 2001 : 1 citations

 1. F. Karablieh, R. Bazzi, M. Hicks, "Compiler-Assisted Heterogeneous Checkpointing", Proc. 20th IEEE Symposium on Reliable Distributed Systems (SRDS'01), October 2001, New Orleans, Louisiana

Year 1999 : 6 citations

 Casanova, H., Kim, M., Plank, J., Dongarra, J., "Adaptive Scheduling for Task Farming with Grid Middleware" International Journal of Supercomputer Applications and High-Performance Computing, Sage Publications Inc, Thousand Oaks, pp 231-240, Volume 13, Number 3, Fall 1999.

 Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson "A Survey of Rollback-Recovery Protocols in Message Passing Systems", CMU Technical Report CMU-CS-99-148, June 1999, Carbegie-Mellon University, USA.

 James S. Plank, Henri Casanova, Micah Beck and Jack Dongarra, "Deploying Fault Tolerance and Task Migration with NetSolve", Future Generation Computer Systems, Volume 15, 1999, pages 745 - 755. Elsevier.

 Casanova H, Kim MH, Plank JS, Dongarra JJ, "Adaptive scheduling for task farming with Grid middlewareâ?, EURO-PAR'99: Parallel Processing Lecture Notes in Computer Science, 1685, 1999, pp. 30-43.

 Plank JS, Chen YQ, Li K, Beck M, Kingsley G, "Memory exclusion: Optimizing the performance of checkpointing systemsâ?, Software-Practice & Experience, John Wiley & Sons Ltd, W Sussex, vol. 29, no. 2, FEB 1999, pp. 125-142.

 Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson "A Survey of Rollback-Recovery Protocols in Message Passing Systems", CMU Technical Report CMU-CS-99-148, June 1999, Carbegie-Mellon University, USA.

Year 1998 : 3 citations

 Russ SH, Robinson J, Flachs BK, Heckel B, "The hector distributed run-time environment?, IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 11, NOV 1998, pp. 1102-1114.

 J. A. Kohl, P. M. Papadopoulos, ``Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS,' 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, Welches, OR, August 1998.

 Plank JS, Casanova H, Beck M, Dongarra J, "Deploying fault-tolerance and task migration with NetSolve?, Applied Parallel Computing, Lecture Notes in Computer Science, Springer-Verlag Berlin, 1541, 1998, pp. 418-432.

Year 1997 : 7 citations

 Samuel H. Russ, Jonathan Robinson, Matt Gleeson, Jose Figueroa "Dynamic Communication Mechanism Switching in Hector" Technical Report No. MSSU--EIRS--ERC--97--8 Mississippi State University, USA, 1997.

 V. K. Naik, S. P. Midkiff, and J. E. Moreira, " A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems," in the Proceedings of SC97:High Performance Networking and Computing, (San Jose, CA), Nov 1997.

 James S. Plank and Youngbae Kim and Jack J. Dongarra"Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing", Journal of Parallel and Distributed Computing, Academic Press Inc, San Diego, vol. 43, no. 2, pp 125-138., Junho 15- 1997.

 James S. Plank, "An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance", Technical Report of University of Tennessee, UT-CS-97-372, Jul. 1997.

 Y.Kim, J.S.Plank, J.Dongarra. "Fault-Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing", Prof. High-Performance Computing on the Information Superhighway, HPC Asia'97, Seoul, Korea, April 1997

 V.Naik, S.Midkiff, J.Moreira: "A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems", Proceedings of Supercomputing'97, 1997

 Bazzi, R.A. "Portable memory", Proceedings of the Sixth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems, 1997., Tunis , Tunisia

Year 1996 : 2 citations

 Y. Kim, J. S. Plank, and J. J. Dongarra. Fault tolerant matrix operations using checksum and reverse computation. In The 6th Symposium of The Frontiers of Massively Parallel Computation, pages 70--77, Annapolis MD, October 1996.

 Youngbae Kim "Fault Tolerant Matrix Operations for Parallel and Distributed Systems" PhD Thesis, The University of Tennessee, USA, August 1996.

Year 1995 : 1 citations

 1. Samuel Russ, Brian Flachs,Jonathan Robinson, Bjorn Heckel "Hector: Automated Task Allocation for MPI" Mississipi State University Technical Report, MSSU-EIRS-ERC-95-6, September 19, 1995.