6-8 avr. 2016 Lyon (France)
Leveraging partial determinism in MPI applications for efficient fault tolerance
Thomas Ropars  1  
1 : Université Grenoble Alpes  (UGA)
Université Grenoble Alpes

System-level fault tolerant techniques for HPC applications are mostly rollback-recovery techniques. The main rollback-recovery techniques for distributed applications were designed in the 80's and the 90's, and were not making any assumption about the nature of distributed applications. With the increasing scale of HPC systems, people started questioning the scalability of these techniques. On the other hand most large-scale HPC applications, and more specifically MPI applications, exhibit characteristics that could be taken into account to improve rollback-recovery performance. Namely, the execution of most MPI applications is at least partially deterministic. This partial determinism can help designing new scalable fault tolerant protocols. In this talk, I will introduce a new execution model that captures this partial determinism, and present scalable fault tolerant protocols designed considering this model.



  • Présentation
Personnes connectées : 1