Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-28 Thread Brian Smith
I've used the TCP btl and it works fine. Its only with openib btl that I have issues. I have a set of nodes that uses qib and psm. This mtl works fine also. I'll try adjusting rendezvous limit and message settings as well as the collective algorithm options and see if that helps. Many thanks

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-28 Thread Terry Dontje
Here are a couple other suggestions: 1. Have you tried your code with using the TCP btl just to make sure this might not be a general algorithm issue with the collective? 2. While using the openib btl you may want to try things with rdma turned off by using the following parameters to mpiru

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-27 Thread Terry Dontje
With this earlier failure do you know how many message may have been transferred between the two processes? Is there a way to narrow this down to a small piece of code? Do you have totalview or ddt at your disposal? --td Brian Smith wrote: Also, the application I'm having trouble with appe

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-27 Thread Brian Smith
Also, the application I'm having trouble with appears to work fine with MVAPICH2 1.4.1, if that is any help. -Brian On Tue, 2010-07-27 at 10:48 -0400, Terry Dontje wrote: > Can you try a simple point-to-point program? > > --td > > Brian Smith wrote: > > After running on two processors across t

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-27 Thread Brian Smith
Hi, Terry, I just ran through the entire gamut of OSU/OMB tests -- osu_bibw osu_latency osu_multi_lat osu_bw osu_alltoall osu_mbw_mr osu_bcast -- on various nodes on one of our clusters (at least two nodes per job) w/ version 1.4.2 and OFED 1.5 (executables and mpi compiled w/ gcc 4.4.2) and haven

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-27 Thread Terry Dontje
Can you try a simple point-to-point program? --td Brian Smith wrote: After running on two processors across two nodes, the problem occurs much earlier during execution: (gdb) bt #0 opal_sys_timer_get_cycles () at ../opal/include/opal/sys/amd64/timer.h:46 #1 opal_timer_base_get_cycles () at .

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-27 Thread Brian Smith
After running on two processors across two nodes, the problem occurs much earlier during execution: (gdb) bt #0 opal_sys_timer_get_cycles () at ../opal/include/opal/sys/amd64/timer.h:46 #1 opal_timer_base_get_cycles () at ../opal/mca/timer/linux/timer_linux.h:31 #2 opal_progress () at runtime/o

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-27 Thread Brian Smith
Both 1.4.1 and 1.4.2 exhibit the same behaviors w/ OFED 1.5. It wasn't OFED 1.4 after all (after some more digging around through our update logs). All of the ibv_*_pingpong tests appear to work correctly. I'll try running a few more tests (np=2 over two nodes, some of the OSU benchmarks, etc.)

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-27 Thread Terry Dontje
A clarification from your previous email, you had your code working with OMPI 1.4.1 but an older version of OFED? Then you upgraded to OFED 1.4 and things stopped working? Sounds like your current system is set up with OMPI 1.4.2 and OFED 1.5. Anyways, I am a little confused as to when thing

Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-26 Thread Brian Smith
In case my previous e-mail is too vague for anyone to address, here's a backtrace from my application. This version, compiled with Intel 11.1.064 (OpenMPI 1.4.2 w/ gcc 4.4.2), hangs during MPI_Alltoall instead. Running on 16 CPUs, Opteron 2427, Mellanox Technologies MT25418 w/ OFED 1.5 strace on

[OMPI users] Processes stuck after MPI_Waitall() in 1.4.1

2010-07-21 Thread Brian Smith
Hi, All, A couple of applications that I'm using -- VASP and Charmm -- end up "stuck" (for lack of a better word) during a waitall call after some non-blocking send/recv action. This only happens when utilizing the openib btl. I've followed a couple of bugs where this seemed to happen in some pr