We often get the following errors when more than one job runs on the
same compute node. We are using Slurm with OpenMPI. The IB cards are
QLogic using PSM:
10698ipath_userinit: assign_context command failed: Network is down
node01.10698can't open /dev/ipath, network down (err=26)
node01.10703ipath_userinit: assign_context command failed: Network is down
node01.10703can't open /dev/ipath, network down (err=26)
node01.10701ipath_userinit: assign_context command failed: Network is down
node01.10701can't open /dev/ipath, network down (err=26)
node01.10700ipath_userinit: assign_context command failed: Network is down
node01.10700can't open /dev/ipath, network down (err=26)
node01.10697ipath_userinit: assign_context command failed: Network is down
node01.10697can't open /dev/ipath, network down (err=26)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network
link is
active on the node and the hardware is functioning.
Error: Could not detect network connectivity
--------------------------------------------------------------------------
Any Ideas how to fix this?
--
Prentice