I've used the TCP btl and it works fine. Its only with openib btl that
I have issues. I have a set of nodes that uses qib and psm. This mtl
works fine also. I'll try adjusting rendezvous limit and message
settings as well as the collective algorithm options and see if that
helps.
Many thanks
Here are a couple other suggestions:
1. Have you tried your code with using the TCP btl just to make sure
this might not be a general algorithm issue with the collective?
2. While using the openib btl you may want to try things with rdma
turned off by using the following parameters to mpiru
With this earlier failure do you know how many message may have been
transferred between the two processes? Is there a way to narrow this
down to a small piece of code? Do you have totalview or ddt at your
disposal?
--td
Brian Smith wrote:
Also, the application I'm having trouble with appe
Also, the application I'm having trouble with appears to work fine with
MVAPICH2 1.4.1, if that is any help.
-Brian
On Tue, 2010-07-27 at 10:48 -0400, Terry Dontje wrote:
> Can you try a simple point-to-point program?
>
> --td
>
> Brian Smith wrote:
> > After running on two processors across t
Hi, Terry,
I just ran through the entire gamut of OSU/OMB tests -- osu_bibw
osu_latency osu_multi_lat osu_bw osu_alltoall osu_mbw_mr osu_bcast -- on
various nodes on one of our clusters (at least two nodes per job) w/
version 1.4.2 and OFED 1.5 (executables and mpi compiled w/ gcc 4.4.2)
and haven
Can you try a simple point-to-point program?
--td
Brian Smith wrote:
After running on two processors across two nodes, the problem occurs
much earlier during execution:
(gdb) bt
#0 opal_sys_timer_get_cycles ()
at ../opal/include/opal/sys/amd64/timer.h:46
#1 opal_timer_base_get_cycles ()
at .
After running on two processors across two nodes, the problem occurs
much earlier during execution:
(gdb) bt
#0 opal_sys_timer_get_cycles ()
at ../opal/include/opal/sys/amd64/timer.h:46
#1 opal_timer_base_get_cycles ()
at ../opal/mca/timer/linux/timer_linux.h:31
#2 opal_progress () at runtime/o
Both 1.4.1 and 1.4.2 exhibit the same behaviors w/ OFED 1.5. It wasn't
OFED 1.4 after all (after some more digging around through our update
logs).
All of the ibv_*_pingpong tests appear to work correctly. I'll try
running a few more tests (np=2 over two nodes, some of the OSU
benchmarks, etc.)
A clarification from your previous email, you had your code working with
OMPI 1.4.1 but an older version of OFED? Then you upgraded to OFED 1.4
and things stopped working? Sounds like your current system is set up
with OMPI 1.4.2 and OFED 1.5. Anyways, I am a little confused as to
when thing
In case my previous e-mail is too vague for anyone to address, here's a
backtrace from my application. This version, compiled with Intel
11.1.064 (OpenMPI 1.4.2 w/ gcc 4.4.2), hangs during MPI_Alltoall
instead. Running on 16 CPUs, Opteron 2427, Mellanox Technologies
MT25418 w/ OFED 1.5
strace on
Hi, All,
A couple of applications that I'm using -- VASP and Charmm -- end up
"stuck" (for lack of a better word) during a waitall call after some
non-blocking send/recv action. This only happens when utilizing the
openib btl. I've followed a couple of bugs where this seemed to happen
in some pr
11 matches
Mail list logo