[Meanwhile, much later, as I thought I'd sent this...]

Ralph Castain <r...@open-mpi.org> writes:

> Hi Zhang
>
> We have seen little interest in binary level CR over the years, which
> is the primary reason the support has lapsed.

That might be a bit chicken and egg!

> The approach just doesn’t scale very well.

Presumably that depends, and it definitely seems reasonable at our
scale.  (mvapich seems to take it seriously.)

> Once the graduate student who wrote it
> received his degree, there simply wasn’t enough user-level interest to
> motivate the developer members to maintain it.
>
> In the interim, we’ve seen considerable interest in application-level
> CR in its place. You might checkout the SCR library from LLNL as an
> example of what people are doing in that space:

Does it support ORTE?  When I last looked, it said only SLURM, but maybe
that doesn't include mvapich with other starters.  Also it assumes local
storage (or the associated in-memory filesystem), in case that's an
issue.

Is SCR not actually used for system-level checkpoints in mvapich?  I
assumed it was from what I'd read.

> https://computation.llnl.gov/project/scr/
> <https://computation.llnl.gov/project/scr/>
>
> We did have someone (another graduate student) recently work with the
> community to attempt to restore the binary-level CR support, but he
> didn’t get a chance to complete it prior to graduating. So we are
> removing the leftover code from the 2.x release series until someone
> comes along with enough interest to repair it.

How much knowledge and effort would that take?  Presumably knowing what
broke it would give some indication.

> Assuming that hasn’t happened before sometime next year, I might take
> a shot at it then - but I won’t have any time to work on it before
> next spring at the earliest, and as I said, it isn’t clear there is a
> significant user base for binary-level CR with the shift to
> application-level systems.

I'm sure it varies, but I don't see much useful checkpointing support,
and/or users willing to use it, here.

[Quite often it would be more useful to migrate part of a job, rather
than restart the whole thing, though that obviously requires support
from the resource manager.]

Reply via email to