Dear Group,
I have a mpi-program in which every process is communicating with its
neighbors. When SELF-checkpointing, every process writes to a separate
file.Problem is that sometimes after making a checkpoint, program does not
continue again. Having more number of processes makes this problem severe.With
just 1 process (no communication), SEFL-checkpointing works normally with no
problem.I have tried different '--mca btl' parameters (openib,tcp,sm,self), but
problem persists.I would very much appreciate your support regarding it.
Kind regards,Faisal