You might want to check out these blog entries about the tree-based launcher in Open MPI for a little background:
http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 Your mail describes several issues; let's break them down: 1. Ctrl-Z issues. For the moment "don't do that". The launch should be fast enough that you shouldn't need to pause the launch and re-start it later. Meaning: let's solve the other issues first, and come back to the ctrl-Z issues later. 2. Unable to resolve: can you be more specific on this? Do you know if OMPI successfully launched its helper daemons ("orted") on some, but not all of the servers in question? Can you send the full command and output that you're executing? 3. Host key verification failed: this likely means an ssh misconfiguration somewhere on your machines. I.e., one server is trying to ssh to another server, and the ssh host key verification between those two servers fails. You might want to check that all your host keys and host key caches are correct between all machines. > On Mar 31, 2015, at 3:27 PM, Thomas Klimpel <jacques.gent...@gmail.com> wrote: > > I have a nasty bug in my software and can make it crash by stopping it with > ctrl-Z, waiting many seconds, and then saying "fg", for continuing the run. > At least it crashes when I start it on 3 workers with 24 instances each plus > a master. On 2 workers with 24 instances each it doesn't crash. > > I decided to upgrade openmpi from 1.6.5 to 1.8.4 for a test, to see if it > changes anything. But 1.8.4 behaves completely different from 1.6.5, at least > when I change the number of instances or the number of workers. If I start it > with 3 workers plus a master, I get the error message > > > Host key verification failed. > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > > > If I start with 3 workers but no separate master, or 2 workers and a separate > master, the software seems to run fine, but after some time I get the warning > (with no further consequences) > > > [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one > event_base_loop can run on each event_base at once. > > > If I start completely differenc command line options (of my software) and a > GUI I always get (even if I only start just a single worker on the master, > i.e. no separate workers) > > > ssh: Could not resolve hostname xxx.yyy.zzz: Name or service not known > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > > > This last message certainly contains some truth, because even "dig > xxx.yyy.zzz" cannot resolve this hostname (not even on the master itself), > and xxx.yyy.zzz is indeed the hostname of the master. > > > But the differences don't stop here. If I press ctrl-Z, the mesage > "^Zorterun: Forwarding signal 20 to job" appears, and the shell freezes until > I press crtl-C. I work around this now by sending the master a SIGSTOP signal > via "kill -SIGSTOP 12345", and after many seconds a "kill -SIGCONT 12345". > Still, the behavior of 1.8.4 is completely different from 1.6.5 here too. > > Another difference is that 1.8.4 often complains that I have requested too > many workers (or that some libnuma would be missing), which I fix by adding > --bind-to socket:overload-allowed. > > > I have at least one question: Is 1.8.4 really so different from 1.6.5, or is > it more probably that I made some mistake while building 1.8.4, for example > using wrong configure options, maybe because these changed since 1.6.5, and I > haven't bothered to read the documentation and just continued to use my > following (old from 1.6.5) configure options: > > --enable-mpi-f77=no --enable-mpi-f90=no --with-threads=posix > --enable-mpi-thread-multiple --disable-vt > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/03/26585.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/