Hi,

I recently installed open mpi (4.0.3) using the procedure described here
<https://www.open-mpi.org/faq/?category=building>, as I'm trying to use
Horovod for multiple gpu acceleration.

I am looking for a way to handle a keyboard interrupt (save a deep learning
model before shutting everything down). I posted a question here
<https://github.com/horovod/horovod/issues/1903>.

I have seen this thread
<https://www.mail-archive.com/users@lists.open-mpi.org/msg26892.html>,
which is inconclusive, and this specific message
<https://www.mail-archive.com/users@lists.open-mpi.org/msg26894.html> which
is really the exact situation I'm in.
And I've seen that this earlier one
<https://www.mail-archive.com/users@lists.open-mpi.org/msg31805.html>
mentions the SIGINT received (although strangely enough when I tried to
print the signal I got SIGCONT instead (the result being the same as above
anyway, my subprocesses just stop without any handling).

I'm wondering if there is a not way of delaying the shutdown of my gpu
processes so I can save the latest state of the model. It would be
practical.

Many thanks in advance for your help,
Jeremie

Reply via email to