\documentstyle[11pt]{article}
\setlength{\topmargin}{-.5in}
\setlength{\textheight}{9in}
\setlength{\oddsidemargin}{.125in}
\setlength{\textwidth}{6.25in}
\begin{document}
\title{Fault Tolerance in Cluster Computing Systems: Survey???}
\author{Cyril Banino, Anne C. Elster, ...\\
NTNU}
\renewcommand{\today}
\maketitle

\section{Introduction}
Clusters of commodity PC hardware (or Network of Workstations) are
becomming widely used as computational resources. We often see in the
litterature that the main argument for purchasing a cluster is the
excellent price/performance ratio which is unmatched by other
platforms. However one should not underestimate the potential of this
class of HPC architecture. Indeed one has seen the growing intrusion
of NOWs (including PC based systems) in the TOP500~\cite{meuer00top} since
1993, and one can see that the actual fifth position is occupied by
the Linux Networx cluster installed in Lawrence Livermore National
Laboratory. 

The need for increasing performance, generates the usage of new
microprocessor technology and the usage of more and more
processors~\cite{strohmaier99marketplace}. 
As an example the Linux Networx cluster is built with 2,304 Intel 2.4
GHz Xeon processors grouped in 1152 nodes. A critical issue with
computers of this size is the number of failures that can occur during
a computation. We will review in this paper the techniques used to
handle failures in high performance computing applications. 

\section{Checkpointing}
One common way to handle failures in today's applications is to
checkpoint periodically the state of the computation~\cite{bricker92condor}. If
a failure occurs, then all the processes are killed and the
application is restarted from the last checkpoint.  

Another form of checkpointing called Diskless
Checkpointing~\cite{plank98diskless} consists of adding extra processors which
will receive and store encoded data from the application processors.

The checkpointing technique has proven to be very usefull, however this
technique is not very scalable. Indeed when one is running its
application on huge and complex systems, one can not afford to kill
tousands of processes because one failed.


\section{MPI and fault tolerance}
MPI has become a standard regarding message passing library to build
high performance applications. However MPI was designed with a static
process model, which is fairly sufficient for small numbers of
distributed nodes like clusters, but which becomes critical in a high
rate failure environment. To handle failure at the MPI level, a new
implementation of MPI called FT-MPI~\cite{fagg00ftmpifaulttolerant} has been released.

\section{Naturally Fault Tolerant Algorithms}

Al Geist and Christian Engelmann propose in~\cite{geist} to use
naturally fault tolerant algorithms as long as the application permit
it. These algorithms provide the correct answer despite the fact that
some failures occured during the computation. 

\bibliographystyle{unsrt}
\bibliography{biblio.bib}


\end{document}