6-8 avr. 2016 Lyon (France)
Exploiting linearity and asynchrony to reduce errors' impact in iterative solvers
Marc Casas  1  
1 : Barcelona Supercomputing Center [Barcelona]  (BSC)  -  Site web
Torre Girona c/ Jordi Girona, 31 08034 Barcelona -  Espagne

This talk presents a method to protect linear solvers from Detected and Uncorrected Errors (DUE) relying on error detection mechanisms already implemented in commodity hardware. These capabilities are able to detect errors at a memory page granularity, which enables the use of simple linear relationships for correction porposes. Such linear relations, which are straightforwardly derived and allow exact data recoveries, would be inapplicable under coarse grain error detection scenarios. The linear recovery techniques are deployed either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. 


Personnes connectées : 1