P96: Correcting Detectable Uncorrectable Errors in Memory
SessionPoster Reception
Event Type
ACM Student Research Competition
Poster
Reception
TimeTuesday, November 14th5:15pm -
7pm
LocationFour Seasons Ballroom
DescriptionWith the expected decrease in Mean Time Between
Failures, Fault Tolerance has been identified as one of
the major challenges for exascale computing. One source
of faults are soft errors caused by cosmic rays, which
can cause bit corruptions to the data held in memory.
Current solutions for protection against these errors
include Error Correcting Codes, which can detect and/or
correct these errors. When an error that can be detected
but not corrected occurs, a Detectable Uncorrectable
Error (DUE) results, and unless checkpoint-restart is
used, the system will usually fail. In our work we
present a probabilistic method of correcting DUEs which
occur in the part of the memory where the program
instructions are stored. We devise a correction
technique for DUEs for the ARM A64 instruction set which
combines extended Hamming code with Cyclic Redundancy
Check code to provide near 100% Successful Correction
Rate of DUEs.




