I have a nasty bug in my software and can make it crash by stopping it with
ctrl-Z, waiting many seconds, and then saying "fg", for continuing the run.
At least it crashes when I start it on 3 workers with 24 instances each
plus a master. On 2 workers with 24 instances each it doesn't crash.

I decided to upgrade openmpi from 1.6.5 to 1.8.4 for a test, to see if it
changes anything. But 1.8.4 behaves completely different from 1.6.5, at
least when I change the number of instances or the number of workers. If I
start it with 3 workers plus a master, I get the error message


Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.


If I start with 3 workers but no separate master, or 2 workers and a
separate master, the software seems to run fine, but after some time I get
the warning (with no further consequences)


[warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one
event_base_loop can run on each event_base at once.


If I start completely differenc command line options (of my software) and a
GUI I always get (even if I only start just a single worker on the master,
i.e. no separate workers)


ssh: Could not resolve hostname xxx.yyy.zzz: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.


This last message certainly contains some truth, because even "dig
xxx.yyy.zzz" cannot resolve this hostname (not even on the master itself),
and xxx.yyy.zzz is indeed the hostname of the master.


But the differences don't stop here. If I press ctrl-Z, the mesage
"^Zorterun: Forwarding signal 20 to job" appears, and the shell freezes
until I press crtl-C. I work around this now by sending the master a
SIGSTOP signal via "kill -SIGSTOP 12345", and after many seconds a "kill
-SIGCONT 12345". Still, the behavior of 1.8.4 is completely different from
1.6.5 here too.

Another difference is that 1.8.4 often complains that I have requested too
many workers (or that some libnuma would be missing), which I fix by adding
--bind-to socket:overload-allowed.


I have at least one question: Is 1.8.4 really so different from 1.6.5, or
is it more probably that I made some mistake while building 1.8.4, for
example using wrong configure options, maybe because these changed since
1.6.5, and I haven't bothered to read the documentation and just continued
to use my following (old from 1.6.5) configure options:

--enable-mpi-f77=no --enable-mpi-f90=no --with-threads=posix
--enable-mpi-thread-multiple --disable-vt

Reply via email to