Ok, I finally was able to get on and run some ofed tests - it looks to
me like I must have something configured wrong with the qlogic cards,
but I have no idea what???
Mellanox to Qlogic:
ibv_rc_pingpong n15
local address: LID 0x0006, QPN 0x240049, PSN 0x87f83a, GID ::
remote address: LID
Interesting.
Try with the native OFED benchmarks -- i.e., get MPI out of the way and see if
the raw/native performance of the network between the devices reflects the same
dichotomy.
(e.g., ibv_rc_pingpong)
On Jul 15, 2011, at 7:58 PM, David Warren wrote:
> All OFED 1.4 and 2.6.32 (that's wh
All OFED 1.4 and 2.6.32 (that's what I can get to today)
qib to qib:
# OSU MPI Latency Test v3.3
# SizeLatency (us)
0 0.29
1 0.32
2 0.31
4 0.32
8 0.32
16
I don't think too many people have done combined QLogic + Mellanox runs, so
this probably isn't a well-explored space.
Can you run some microbenchmarks to see what kind of latency / bandwidth you're
getting between nodes of the same type and nodes of different types?
On Jul 14, 2011, at 8:21 PM
On my test runs (wrf run just long enough to go beyond the spinup influence)
On just 6 of the the old mlx4 machines I get about 00:05:30 runtime
On 3 mlx4 and 3 qib nodes I get avg of 00:06:20
So the slow down is about 11+%
When this is a full run 11% becomes a evry long time. This has held for
On Jul 13, 2011, at 7:46 PM, David Warren wrote:
> I finally got access to the systems again (the original ones are part of our
> real time system). I thought I would try one other test I had set up first.
> I went to OFED 1.6 and it started running with no errors. It must have been
> an OFED
I finally got access to the systems again (the original ones are part of
our real time system). I thought I would try one other test I had set up
first. I went to OFED 1.6 and it started running with no errors. It
must have been an OFED bug. Now I just have the speed problem. Anyone
have a way
Huh; wonky.
Can you set the MCA parameter "mpi_abort_delay" to -1 and run your job again?
This will prevent all the processes from dying when MPI_ABORT is invoked. Then
attach a debugger to one of the still-live processes after the error message is
printed. Can you send the stack trace? It