DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. Don’t recall it being any trouble to install.
http://dmtcp.sourceforge.net/ On Oct 4, 2019, at 9:47 PM, Eliot Moss <m...@cs.umass.edu<mailto:m...@cs.umass.edu>> wrote: Dear slurm users -- I'm new to slurm (somewhat experienced with Grid Engine, though that's not relevant to this post). I have access to two slurm based clusters, and have an application that (a) can be _very_long running (more than 8 weeks for one execution, though the compute and I/O demands of one such job are not huge by modern standards) and that (b) is not at all practical to convert to do its own checkpoints. (I am running traces from the valgrind program of every memory reference and branch made when running individual SPEC benchmarks; this is then piped to 8 downstream analyzers, mostly Java programs.) From what I have read, BLCR would meet my needs for checkpointing, but the admins of both clusters are reluctant to pursue BLCR support. I myself am wondering whether it is still working, etc., and what it means that built-in support has been removed, etc. Can someone offer a brief explanation of the status and recent history of BLCR w.r.t. slurm? Many thanks! Eliot Moss, UMass Amherst Computer Science