I stumbled across CRIU (Checkpoint/Restore In Userspace) https://criu.org/Main_Page <https://criu.org/Main_Page> a couple of weeks ago. I have not utilized it yet it; it's on my ToDo list. They claim that it’s packaged with most distress; I checked RHEL/CentOS and it was there. Be careful of package/kernel versions; i.e a good reason to go with the version included in your distro. BLCR was last updated January 2013; back in the day, it worked well enough for simpler apps; complicated MPI apps was less so.
- geo > On Oct 4, 2019, at 11:17 PM, Renfro, Michael <ren...@tntech.edu> wrote: > > This message was sent from a non-IU address. Please exercise caution when > clicking links or opening attachments from external sources. > > DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. > Don’t recall it being any trouble to install. > > http://dmtcp.sourceforge.net/ <http://dmtcp.sourceforge.net/> > > On Oct 4, 2019, at 9:47 PM, Eliot Moss <m...@cs.umass.edu > <mailto:m...@cs.umass.edu>> wrote: > >> Dear slurm users -- >> >> I'm new to slurm (somewhat experienced with Grid Engine, though that's >> not relevant to this post). I have access to two slurm based clusters, >> and have an application that (a) can be _very_long running (more than >> 8 weeks for one execution, though the compute and I/O demands of one >> such job are not huge by modern standards) and that (b) is not at all >> practical to convert to do its own checkpoints. (I am running traces >> from the valgrind program of every memory reference and branch made >> when running individual SPEC benchmarks; this is then piped to 8 >> downstream analyzers, mostly Java programs.) >> >> From what I have read, BLCR would meet my needs for checkpointing, >> but the admins of both clusters are reluctant to pursue BLCR support. >> I myself am wondering whether it is still working, etc., and what it >> means that built-in support has been removed, etc. Can someone offer >> a brief explanation of the status and recent history of BLCR w.r.t. >> slurm? >> >> Many thanks! Eliot Moss, UMass Amherst Computer Science >>
smime.p7s
Description: S/MIME cryptographic signature