Hi!
I'm trying to implement checkpointing on out cluster, and I have obvious 
question.

I guess this was implemented many times by other users, so I would like is 
someone share experience with me.

With serial/multithreaded jobs it is kind of clear. But for parallel?

We have "fat" 16-core nodes, so user use both OpenMP and MPI programs.

Shell I just do perform some checks in my checkpointing script and call 
ompi-checkpoint if after tests I figure our that there is MPI job?

What is "usual" way?

Best,

Anton



Reply via email to