I know 1.8.4 is better than 1.6.5 in some regards, but I obviously can't
say if we fixed the specific bug you're referring to in your software. As
you know, thread bugs are really hard to nail down.

That event_base_loop warning could be flagging a known problem in the
openib module during inter-process connection formation. It's been on our
radar for awhile, but lacked cycles to resolve it. You might double-check
by running with "--mca btl ^openib" to see if that is the source of the
warning - I know it will run a lot slower, but you *might* get an
indication as to whether this is or isn't the issue.

Does it only crash when you pause it? Or does it crash while normally
running?


On Wed, Apr 1, 2015 at 12:09 PM, Thomas Klimpel <jacques.gent...@gmail.com>
wrote:

> > 2. Unable to resolve: can you be more specific on this?
>
> This was my mistake. I used "xxx.yyy.zzz" instead of "localhost" in the
> startup options for orterun. (More precisely the GUI did it, but I knew
> that code.) No idea how 1.6.5 managed to get around the fact that not even
> "dig xxx.yyy.zzz" can resolve this hostname. All the other servers were
> specified by their ip address, so no need to resolve anything there.
>
>
> > 3. Host key verification failed: this likely means an ssh
> misconfiguration somewhere on your machines.
>
> You are right, only the master could do a password less ssh to the
> workers, but the workers could not do a passwordless ssh to the master (or
> to any other worker). I manually enabled this between 3 selected workers,
> and checked that everything worked fine then. But my method to enable this
> manually is time consuming, so now I use "-mca plm_ssh_no_tree_spawn 1"
> as option to orterun instead.
>
> Thanks for the help. This enabled me to do the tests I wanted to do.
>
>
> > 1. Ctrl-Z issues. For the moment "don't do that".
>
> As said, I use "kill -SIGSTOP 12345" instead now. Even if the shell would
> not freeze, and orterun would stop (after first forwarding the signal to
> all workers, which seems to be the most reasonable behavior to me), I would
> still have to use "kill -SIGSTOP 12345" (because I don't want to pause the
> workers, only the master). I verified that this triggers the crash reliable
> for me with 1.6.5.
>
> I cannot reproduce my crash with 1.8.4, but I'm not sure what I learn from
> this. Maybe the new "[warn] opal_libevent2021_event_base_loop: reentrant
> invocation. Only one event_base_loop can run on each event_base at once."
> warning tries to tell me that I'm using MPI_THREAD_MULTIPLE incorrectly.
> But I radically simplified my mpi calls for this test now, such that I only
> call MPI_Send and MPI_Recv, and only on MPI_COMM_WORLD. But I still get the
> warning with 1.8.4, and still can produce my crash with 1.6.5, and still
> cannot reproduce my crash with 1.8.4. Is it really possible that
> MPI_THREAD_MULTIPLE had a bug (the clusters were this bug can be triggered
> have infiniband interconnect) in 1.6.5, which is fixed in 1.8.4?
>
> I still fear that the bug is somewhere else in my software (because of the
> history of this bug and how hard it often was to trigger it in the past).
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26591.php
>

Reply via email to