[Meanwhile, much later, as I thought I'd sent this...] Ralph Castain <r...@open-mpi.org> writes:
> Hi Zhang > > We have seen little interest in binary level CR over the years, which > is the primary reason the support has lapsed. That might be a bit chicken and egg! > The approach just doesn’t scale very well. Presumably that depends, and it definitely seems reasonable at our scale. (mvapich seems to take it seriously.) > Once the graduate student who wrote it > received his degree, there simply wasn’t enough user-level interest to > motivate the developer members to maintain it. > > In the interim, we’ve seen considerable interest in application-level > CR in its place. You might checkout the SCR library from LLNL as an > example of what people are doing in that space: Does it support ORTE? When I last looked, it said only SLURM, but maybe that doesn't include mvapich with other starters. Also it assumes local storage (or the associated in-memory filesystem), in case that's an issue. Is SCR not actually used for system-level checkpoints in mvapich? I assumed it was from what I'd read. > https://computation.llnl.gov/project/scr/ > <https://computation.llnl.gov/project/scr/> > > We did have someone (another graduate student) recently work with the > community to attempt to restore the binary-level CR support, but he > didn’t get a chance to complete it prior to graduating. So we are > removing the leftover code from the 2.x release series until someone > comes along with enough interest to repair it. How much knowledge and effort would that take? Presumably knowing what broke it would give some indication. > Assuming that hasn’t happened before sometime next year, I might take > a shot at it then - but I won’t have any time to work on it before > next spring at the earliest, and as I said, it isn’t clear there is a > significant user base for binary-level CR with the shift to > application-level systems. I'm sure it varies, but I don't see much useful checkpointing support, and/or users willing to use it, here. [Quite often it would be more useful to migrate part of a job, rather than restart the whole thing, though that obviously requires support from the resource manager.]