so this isn't really an OpenMPI questions (I don't think), but you guys
will have hit the problem if anyone has...

basically I'm seeing wildly different bandwidths over InfiniBand 4x DDR
when I use different kernels.
I'm testing with netpipe-3.6.2's NPmpi, but a home-grown pingpong sees
the same thing.

the default 2.6.9-42.0.3.ELsmp (and also sles10's kernel) gives ok
bandwidth (50% of peak I guess is good?) at ~10 Gbit/s, but a pile of
newer kernels (2.16.19.2, 2.6.20-rc4, 2.6.18-1.2732.4.2.el5.OFED_1_1(*))
all max out at ~5.3 Gbit/s.

half the bandwidth! :-(
latency is the same.

the same OpenMPI (1.1.1 from OSCAR, rebuild for openib support) and
NPmpi was used with all kernels.
I see an intermediate bandwidth if one kernel is the 'fast' 2.6.9 and
another is a 'slow', so they don't appear to be using completely
different protocols.
it doesn't make any difference if I try to make extra-sure it's using
openib with:
  mpirun --mca btl openib --mca btl_tcp_if_exclude lo,eth0 ...

OS is CentOS 4.4 x86_64 which AFAICT includes packages based on OFED 1.0.
lspci says the PCIe card is:
  InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
and dmesg says that all kernels are using
  ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
but also winges that 'HCA FW version 1.0.700 is old'.

any ideas?
very odd that all new kernels (including for RHEL5) are slow.

will OFED 1.1 make any difference? it didn't build cleanly when I
tried, but I can try and try again...

thanks for any hints.

cheers,
robin

(*) rhel5 + OFED 1.1 test kernel, rebuilt for centos4.4 from src.rpm at
  
http://people.redhat.com/dledford/Infiniband/kernel/2.6.18/1.2732.4.2.el5.OFED_1_1/x86_64/

Reply via email to