Child pages
  • EESI2 Resilience Working Group
Skip to end of metadata
Go to start of metadata

Welcome to the EESI2 Resilience Working Group WIKI

This page contains links and references to key topics, documents on Resilience for Exascale

Next conference call: May 15, 2014

Agenda:

-Update on the different topics from the experts.

-Preparation of the Face to Face meeting in September at Paris (Inria Place Italie)

Participants of the EESI2 working group:

Osman Unsal, BSC, Simon McIntosh-Smith, University of Bristol, Torsten Hoefler EPFL, Bogdan Nicolae, IBM, Christine Morin, Inria, Pascale Rosse-laurent, Bull, Luc Giraud, Inria

Links to Exascale documents:

IESP

EESI

ICIS

DoE

Key topics:

Method level resilience to silent errors (ensembles, etc.)
Algorithm Specific Resilience Approaches (Linear Algebra, Graph, N-core, etc.)
Algorithmic level resilience (checksums, ABFT, etc.)
Fault tolerance programmin interface (FTI, XscaleMP, for PGAS)
Fault tolerant runtime (MPI, Nanos)
Replication approaches (RMPI, RedMPI)
Checkpointing libraries (FTI, SCR)
Checkpointing acceleration techniques (Incremental, compression, page reordering)
Checkpointing interval calculation
Fault tolerant protocols (coordinated, message logging, hybrid)
Failure prediction (HELO, ELSA)
Low level fault tolerance approach (Containment domains)
Hardware level resilience (detection/correction of errors)
Error and failure characterisation 

Links to key publications:

 

  • No labels