Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-16 Thread Thomas Klimpel
Ralph wrote: > We found a locking error in vader - this has been fixed in the OMPI master and will be in the 1.8.5 nightly tarball tomorrow. I tested with the nightly tarball now. The deadlocks are fixed. Thanks! The warning [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-09 Thread Thomas Klimpel
I tried 1.8.5rc1 now. It behaves very similar to 1.8.4 from my point of view (and completely different from 1.6.5). The warning [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once. is still there. It's easy for me to (re)prod

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-07 Thread Thomas Klimpel
Here is a stackdump from inside the debugger (because it gives filenames and line numbers): Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f1eb6bfd700 (LWP 24847)] 0x00366aa79252 in _int_malloc () from /lib64/libc.so.6 (gdb) bt #0 0x00366aa79252 in _int_mallo

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-02 Thread Thomas Klimpel
The normal crash without crtl-Z can produce different stackdumps. With ctrl-Z, the stackdump looks nearly always as follows: (In the debugger, I get source files and line-numbers, so I guess it is built with debug-info) ​[wam-r02c01b02:19183] *** Process received signal *** [wam-r02c01b02:19183] S

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Thomas Klimpel
> You might double-check by running with "--mca btl ^openib" to see if that is the source of the warning The warning appears always, independent of the interconnect, and even when running with "--mca btl ^openib". > Does it only crash when you pause it? Or does it crash while normally running?

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Thomas Klimpel
> 2. Unable to resolve: can you be more specific on this? This was my mistake. I used "xxx.yyy.zzz" instead of "localhost" in the startup options for orterun. (More precisely the GUI did it, but I knew that code.) No idea how 1.6.5 managed to get around the fact that not even "dig xxx.yyy.zzz" can

[OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-03-31 Thread Thomas Klimpel
I have a nasty bug in my software and can make it crash by stopping it with ctrl-Z, waiting many seconds, and then saying "fg", for continuing the run. At least it crashes when I start it on 3 workers with 24 instances each plus a master. On 2 workers with 24 instances each it doesn't crash. I dec