6-8 avr. 2016 Lyon (France)
Resilience algorithms to cope with fail-stop and silent errors
Hongyang Sun  1  
1 : Laboratoire de l'Informatique du Parallélisme  (LIP)  -  Site web
École Normale Supérieure (ENS) - Lyon, INRIA, CNRS : UMR5668, PRES Université de Lyon, Université Claude Bernard - Lyon I (UCBL)
46 Allée d'Italie 69364 LYON CEDEX 07 -  France

This talk focuses on resilience algorithms at extreme scale. Many papers deal with fail-stop errors, many others deal with silent errors (or silent data corruptions), but very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This talk presents a unified framework
and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either
partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and
checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal
pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors.


  • Présentation
Personnes connectées : 1