Hi Yevgeny and list,


----- Original Message -----

> From: Yevgeny Kliteynik <klit...@dev.mellanox.co.il>

> I'll check the MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thing and get back to you.

Thank you.

> One question though, just to make sure we're on the same page: so the jobs 
> do run OK on
> the older HCAs, as long as they run *only* on the older HCAs, right?

Yes, correct.  They run on the newer hosts using the newer (ConnectX) HCAs as 
long as the jobs stay on the same (newer) HCA type, and they run on the older 
HCAs (mthca) so long as the jobs stay on the same HCA type as well.  IOW, as 
long as the jobs run on homogeneous IB hardware, they run successfully to 
completion.  We've successfully done stuff like Checkpoint/Restart using the 
BLCR functionality, and it all seems to work well and in a seemingly robust way.

> Please make sure that the jobs are using only IB with "--mca btl 
> openib,self" parameters.

The system is in use right now, so I will have to test this and get back you, 
but I can also say with certainty that we don't specify --mca parameters unless 
a user needs to run on Ethernet-only (to avoid the IB errors we're discussing). 
 Otherwise, it is at the Open MPI 1.5.3 default behavior.  The users are also 
all using the systemwide Open MPI installation, so this isn't an issue of an 
erroneous local configuration lying around from multiple parallel installs, or 
interfering copies of different builds, etc.

Other than the mandatory iw_cm kernel module, we are not building/using any 
iWarp or DAPL/uDAPL functionality.  We are also not running IP on the IB 
network.

Reply via email to