Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-17 Thread Ralph Castain
Just FYI: I’ve submitted a pull request to silence that annoying warning :-) > On Apr 16, 2015, at 12:02 PM, Thomas Klimpel > wrote: > > Ralph wrote: > > We found a locking error in vader - this has been fixed in the OMPI master > > and will be in the 1.8.5 nightly tarball tomorrow. > > I tes

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-16 Thread Thomas Klimpel
Ralph wrote: > We found a locking error in vader - this has been fixed in the OMPI master and will be in the 1.8.5 nightly tarball tomorrow. I tested with the nightly tarball now. The deadlocks are fixed. Thanks! The warning [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-14 Thread Ralph Castain
We found a locking error in vader - this has been fixed in the OMPI master and will be in the 1.8.5 nightly tarball tomorrow. Thanks! Ralph > On Apr 9, 2015, at 1:26 PM, Thomas Klimpel wrote: > > I tried 1.8.5rc1 now. It behaves very similar to 1.8.4 from my point of view > (and completely di

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-09 Thread Thomas Klimpel
I tried 1.8.5rc1 now. It behaves very similar to 1.8.4 from my point of view (and completely different from 1.6.5). The warning [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once. is still there. It's easy for me to (re)prod

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-08 Thread Ralph Castain
Hmmm…could you try 1.8.5rc1? We’ve done some thread-related stuff on it, but we may not have solved this level of use just yet. We are working on the new1.9 series that we hope to make more thread friendly http://www.open-mpi.org/software/ompi/v1.8/

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-07 Thread Thomas Klimpel
Here is a stackdump from inside the debugger (because it gives filenames and line numbers): Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f1eb6bfd700 (LWP 24847)] 0x00366aa79252 in _int_malloc () from /lib64/libc.so.6 (gdb) bt #0 0x00366aa79252 in _int_mallo

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-02 Thread Thomas Klimpel
The normal crash without crtl-Z can produce different stackdumps. With ctrl-Z, the stackdump looks nearly always as follows: (In the debugger, I get source files and line-numbers, so I guess it is built with debug-info) ​[wam-r02c01b02:19183] *** Process received signal *** [wam-r02c01b02:19183] S

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Ralph Castain
Would it be possible to get a backtrace from one of the crashes? It would be especially helpful if you can add --enable-debug to the OMPI config. On Wed, Apr 1, 2015 at 1:09 PM, Thomas Klimpel wrote: > > You might double-check by running with "--mca btl ^openib" to see if > that is the source o

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Thomas Klimpel
> You might double-check by running with "--mca btl ^openib" to see if that is the source of the warning The warning appears always, independent of the interconnect, and even when running with "--mca btl ^openib". > Does it only crash when you pause it? Or does it crash while normally running?

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Ralph Castain
I know 1.8.4 is better than 1.6.5 in some regards, but I obviously can't say if we fixed the specific bug you're referring to in your software. As you know, thread bugs are really hard to nail down. That event_base_loop warning could be flagging a known problem in the openib module during inter-pr

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Thomas Klimpel
> 2. Unable to resolve: can you be more specific on this? This was my mistake. I used "xxx.yyy.zzz" instead of "localhost" in the startup options for orterun. (More precisely the GUI did it, but I knew that code.) No idea how 1.6.5 managed to get around the fact that not even "dig xxx.yyy.zzz" can

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-03-31 Thread Jeff Squyres (jsquyres)
You might want to check out these blog entries about the tree-based launcher in Open MPI for a little background: http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi http://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 Your mail describes several issues; l

Re: [OMPI users] 1.8.4

2014-11-12 Thread Ralph Castain
I was going to send something out to the list today anyway - will do so now. > On Nov 12, 2014, at 6:58 AM, Jeff Squyres (jsquyres) > wrote: > > On Nov 12, 2014, at 9:53 AM, Ray Sheppard wrote: > >> Thanks, and sorry to blast my little note out to the list. I guess your >> mail address is

Re: [OMPI users] 1.8.4

2014-11-12 Thread Jeff Squyres (jsquyres)
On Nov 12, 2014, at 9:53 AM, Ray Sheppard wrote: > Thanks, and sorry to blast my little note out to the list. I guess your mail > address is now aliased to the mailing list in my mail client. :-) No worries; I'm sure this is a question on other people's minds, too. -- Jeff Squyres jsquy...@

Re: [OMPI users] 1.8.4

2014-11-12 Thread Ray Sheppard
Thanks, and sorry to blast my little note out to the list. I guess your mail address is now aliased to the mailing list in my mail client. Ray On 11/12/2014 9:41 AM, Jeff Squyres (jsquyres) wrote: We have 2 critical issues left that need fixing (a THREAD_MULTIPLE/locking issue and a shmem is

Re: [OMPI users] 1.8.4

2014-11-12 Thread Jeff Squyres (jsquyres)
We have 2 critical issues left that need fixing (a THREAD_MULTIPLE/locking issue and a shmem issue). There's active work progressing on both. I think we'd love to say it would be ready by SC, but I know that a lot of us -- myself included -- are fighting to meet our own SC deadlines. Ralph Cas