ParaStack: Efficient Hang Detection for MPI Programs at
Large Scale
SessionOptimizing MPI
Authors
Event Type
Paper
Parallel Programming Languages, Libraries, Models
and Notations
Programming Systems
Runtime Systems
TimeThursday, November 16th4pm -
4:30pm
Location405-406-407
DescriptionWhile program hangs on large parallel systems can be
detected via the widely used timeout mechanism, it is
difficult to set an optimal timeout threshold if users
have limited knowledge of a program. Too small timeout
will lead to high false alarm rates, and too large
timeout will waste valuable computing resources. This
paper presents a highly efficient hang detection tool,
ParaStack, that does not rely on timeout. We have
adapted ParaStack to work with Torque and Slurm parallel
job schedulers and validated both its functionality and
performance on the current world's tenth fastest
supercomputer Stampede. Experimental results demonstrate
that ParaStack can detect hangs accurately, in a timely
manner, and at negligible runtime cost. Also ParaStack
pinpoints the faulty processes with high accuracy when
the hang is caused by errors in computation phase.
Download PDF:
here




