This confluence server is slated for retirement. To create new spaces, see The GCE Confluence Server. To request a migration of your existing Confluence spaces, see our space migration request form. For more information on the CELS General Computing Environment, see the CELS Virtual Help Desk.

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Future HPC systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources
that sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs are rather harmful and dangerous in that they cannot be detected by hardware. We propose an online MAchine-learning based CORruption Detection framework (abbreviated
as abbreviated as MACORD) for detecting SDCs in HPC applications. In particular, we comprehensively investigate the prediction ability of a multitude of machine-learning algorithms in our study, and enable the detector to automatically select the bestfit
algorithms bestfit algorithms at runtime to adapt to the data dynamics. Our learning framework exhibits low memory overhead (less than 1%), since it takes only spatial features (i.e., neighboring data values for each data point in the current time step) into
the into the training data. Experiments based on real-world scientific applications/benchmarks show that our framework can get the detection sensitivity (i.e., recall) up to 99% while the false positive rate is limited down to 0.1% in most cases, which is one order of magnitude improvement compared with the latest state-of-art spatial technique.

...