On Thursday, March 10, 2011 08:30:19 pm Thierry LAMOUREUX wrote:
> Hello,
> 
> We add recently enhanced our network with Infiniband modules on a six node
> cluster.
> 
> We have install all OFED drivers related to our hardware
> 
> We have set network IP like following :
> - eth : 192.168.1.0 / 255.255.255.0
> - ib : 192.168.70.0 / 255.255.255.0
> 
> After first tests all seems good. IB interfaces ping each other, ssh and
> other king of exchanges over IB works well.

A very important thing to realise is that TCP/IP on Infiniband, while quite 
possible and sometimes useful, has very little to do with running MPI/OpenMPI 
"using" Infiniband.

MPI data transport can run on either TCP/IP (btl: tcp) or natively on IB (for 
Mellanox btl: openib, for Qlogic mtl: psm).

On top of this job startup uses TCP/IP.
 
> Then we started to run our job thought openmpi (building with --with-openib
> option) and our first results were very bad.

This builds the openib btl but it wont be used runtime if there's no active ib 
interface (I'm _NOT_ talking about interface as listed by ifconfig). Check you 
IB with ibstat or similar.

Also, while it's possible to run MPI traffic on the openib btl (verbs) on 
Qlogic cards you'll have to use the psm mtl (psm) for good performance.

/Peter

> After investigations, our system have the following behaviour :
> - job starts over ib network (few packet are sent)
> - job switch to eth network (all next packet sent to these interfaces)
> 
> We never specified the IP Address of our eth interfaces.
> 
> We tried to launch our jobs with the following options :
> - mpirun -hostfile hostfile.list -mca blt openib,self
> /common_gfs2/script-test.sh
> - mpirun -hostfile hostfile.list -mca blt openib,sm,self
> /common_gfs2/script-test.sh
> - mpirun -hostfile hostfile.list -mca blt openib,self -mca
> btl_tcp_if_exclude lo,eth0,eth1,eth2 /common_gfs2/script-test.sh
> 
> The final behaviour remain the same : job is initiated over ib and runs
> over eth.
> 
> We grab performance tests file (osu_bw and osu_latency) and we got not so
> bad results (see attached files).
> 
> We had tried plenty of different things but we are stuck : we don't have
> any error message...
> 
> Thanks per advance for your help.
> 
> Thierry.

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to