Page tree
Skip to end of metadata
Go to start of metadata

Multi-institutions collaboration topics

Contributors:  

ANL: Franck Cappello, Leonardo Bautista Gomes, 

Inria: Yves Robert, Luc Giraud,

UIUC: Bill Kramer, Sanjay Kale, Jon Calhoun

JSC: Suraj Prabhakaran, 

BSC: Marc Casas,  Luc Jaulmes, Miquel Moretó

System Logs (possibly including scheduler logs, environmental logs, file system logs, etc.) Archive

White paper on Resilience best practices in HPC:

  • Common definition of disruptions
  • SDC:
    • Error measurements --> collecting papers, summarizing them, 
    • SDC injection best practices --> collecting pratices within the JLESC members
    • SDC injection tools --> collecting tools from the different institutions
    • Metrics for detection --> collecting practices
    • Applications set --> collecting applications, configuration, input parameters (in liaison with users and app developers)
    • Recovery
  • Fail stop:
    • Failure logs analysis tools?
    • Best practices in failure injection à collect practices and tools
    • Recovery --> API to connect application/runtime with job/resource scheduler

 

Collaboration projects


  • No labels