Hi, Thanks for the advice. I already tried out mana, but at present it only works with mpich, not openmpi, which is what I've setup via Ubuntu.
AR On Sun, 19 Feb 2023, 02:10 Christopher Samuel, <ch...@csamuel.org> wrote: > On 2/10/23 11:06 am, Analabha Roy wrote: > > > I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in > > my cluster. > > If you're looking to try checkpointing MPI applications you may want to > experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin > for DMTCP here: https://github.com/mpickpt/mana > > We (NERSC) are collaborating with the developers and it is installed on > Cori (our older Cray system) for people to experiment with. The > documentation for it may be useful to others who'd like to try it out - > it's got a nice description of how it works too which even I as a > non-programmer can understand. > https://docs.nersc.gov/development/checkpoint-restart/mana/ > > Pay special attention to the caveats in our docs though! > > I've not used it myself, though I'm peripherally involved to give advice > on system related issues. > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > > >