Thanks for answering. I tested again, this time using a real cluster where I
have the possibility of rebooting the machines at will. I run a test using 32
machines running a MPI process per machine and during the execution I rebooted
one of the machines and I found the same behavior: OpenMPI de
No, that certainly isn’t the normal behavior. I suspect it has to do with the
nature of the VM TCP connection, though there is something very strange about
your output. The BTL message indicates that an MPI job is already running. Yet
your subsequent ORTE error message indicates we are still try
Hello,
I was studying how OpenMPI reacts to failures. I have a virtual
infrastructure where failures can be emulated by turning off a given VM.
Depending on the way the VM is turned off the 'mpirun' will be notified,
either because it receives a signal or because some timeout is reached.
In bo