\documentstyle[11pt]{article} \setlength{\topmargin}{-.5in} \setlength{\textheight}{9in} \setlength{\oddsidemargin}{.125in} \setlength{\textwidth}{6.25in} \begin{document} \title{Fault Tolerance in Cluster Computing Systems: Survey???} \author{Cyril Banino, Anne C. Elster, ...\\ NTNU} \renewcommand{\today} \maketitle \section{Introduction} Clusters of commodity PC hardware (or Network of Workstations) are becomming widely used as computational resources. We often see in the litterature that the main argument for purchasing a cluster is the excellent price/performance ratio which is unmatched by other platforms. However one should not underestimate the potential of this class of HPC architecture. Indeed one has seen the growing intrusion of NOWs (including PC based systems) in the TOP500~\cite{meuer00top} since 1993, and one can see that the actual fifth position is occupied by the Linux Networx cluster installed in Lawrence Livermore National Laboratory. The need for increasing performance, generates the usage of new microprocessor technology and the usage of more and more processors~\cite{strohmaier99marketplace}. As an example the Linux Networx cluster is built with 2,304 Intel 2.4 GHz Xeon processors grouped in 1152 nodes. A critical issue with computers of this size is the number of failures that can occur during a computation. We will review in this paper the techniques used to handle failures in high performance computing applications. \section{Checkpointing} One common way to handle failures in today's applications is to checkpoint periodically the state of the computation~\cite{bricker92condor}. If a failure occurs, then all the processes are killed and the application is restarted from the last checkpoint. Another form of checkpointing called Diskless Checkpointing~\cite{plank98diskless} consists of adding extra processors which will receive and store encoded data from the application processors. The checkpointing technique has proven to be very usefull, however this technique is not very scalable. Indeed when one is running its application on huge and complex systems, one can not afford to kill tousands of processes because one failed. \section{MPI and fault tolerance} MPI has become a standard regarding message passing library to build high performance applications. However MPI was designed with a static process model, which is fairly sufficient for small numbers of distributed nodes like clusters, but which becomes critical in a high rate failure environment. To handle failure at the MPI level, a new implementation of MPI called FT-MPI~\cite{fagg00ftmpifaulttolerant} has been released. \section{Naturally Fault Tolerant Algorithms} Al Geist and Christian Engelmann propose in~\cite{geist} to use naturally fault tolerant algorithms as long as the application permit it. These algorithms provide the correct answer despite the fact that some failures occured during the computation. \bibliographystyle{unsrt} \bibliography{biblio.bib} \end{document}