Hi! I'm trying to implement checkpointing on out cluster, and I have obvious question.
I guess this was implemented many times by other users, so I would like is someone share experience with me. With serial/multithreaded jobs it is kind of clear. But for parallel? We have "fat" 16-core nodes, so user use both OpenMP and MPI programs. Shell I just do perform some checks in my checkpointing script and call ompi-checkpoint if after tests I figure our that there is MPI job? What is "usual" way? Best, Anton