This confluence server is slated for retirement. To create new spaces, see The GCE Confluence Server. To request a migration of your existing Confluence spaces, see our space migration request form. For more information on the CELS General Computing Environment, see the CELS Virtual Help Desk.
Page tree
Skip to end of metadata
Go to start of metadata

Multi-institutions collaboration topics

Contributors:  

ANL: Franck Cappello, Leonardo Bautista Gomes, 

Inria: Yves Robert, Luc Giraud,

UIUC: Bill Kramer, Sanjay Kale, Jon Calhoun

JSC: Suraj Prabhakaran, 

BSC: Marc Casas,  Luc Jaulmes, Miquel Moretó

System Logs (possibly including scheduler logs, environmental logs, file system logs, etc.) Archive

White paper on Resilience best practices in HPC:

  • Common definition of disruptions
  • SDC:
    • Error measurements --> collecting papers, summarizing them, 
    • SDC injection best practices --> collecting pratices within the JLESC members
    • SDC injection tools --> collecting tools from the different institutions
    • Metrics for detection --> collecting practices
    • Applications set --> collecting applications, configuration, input parameters (in liaison with users and app developers)
    • Recovery
  • Fail stop:
    • Failure logs analysis tools?
    • Best practices in failure injection à collect practices and tools
    • Recovery --> API to connect application/runtime with job/resource scheduler

 

Collaboration projects


  • No labels