ipath, network down (err=26)

Prentice Bisbal via users Mon, 11 May 2020 07:31:19 -0700

Thanks. I'm going to give this solution a try.

On 5/9/20 9:51 AM, Patrick Bégou via users wrote:

Le 08/05/2020 à 21:56, Prentice Bisbal via users a écrit :
We often get the following errors when more than one job runs on thesame compute node. We are using Slurm with OpenMPI. The IB cards areQLogic using PSM:
10698ipath_userinit: assign_context command failed: Network is down
node01.10698can't open /dev/ipath, network down (err=26)
node01.10703ipath_userinit: assign_context command failed: Network isdown
node01.10703can't open /dev/ipath, network down (err=26)
node01.10701ipath_userinit: assign_context command failed: Network isdown
node01.10701can't open /dev/ipath, network down (err=26)
node01.10700ipath_userinit: assign_context command failed: Network isdown
node01.10700can't open /dev/ipath, network down (err=26)
node01.10697ipath_userinit: assign_context command failed: Network isdown
node01.10697can't open /dev/ipath, network down (err=26)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the networklink is
active on the node and the hardware is functioning.

Error: Could not detect network connectivity
--------------------------------------------------------------------------

Any Ideas how to fix this?

--
Prentice
Hi Prentice,
This is not openMPI related but merely due to your hardware. I've notmany details but I think this occurs when several jobs share the samenode and you have a large number of cores on these nodes (> 14). Ifthis is the case:
On Qlogic (I'm using such a hardware at this time) you have 16 channelfor communication on each HBA and, if I remember what I had read manyyears ago, 2 are dedicated to the system. When launching MPIapplications, each process of a job request for it's own dedicatedchannel if available, else they share ALL the available channels. Soif a second job starts on the same node it do not remains anyavailable channel.
To avoid this situation I force sharing the channels (my nodes have 20codes) by 2 MPI processes. You can set this with a simple environmentvariable. On all my cluster nodes I create the file:
*/etc/profile.d/ibsetcontext.sh*

And it contains:

# allow 4 processes to share an hardware MPI context
# in infiniband with PSM
*export PSM_RANKS_PER_CONTEXT=2*
Of course if some people manage to oversubscribe on the cores (morethan one process by core) it could rise again the problem but we donot oversubscribe.
Hope this can help you.

Patrick

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] [External] Re: can't open /dev/ipath, network down (err=26)

Reply via email to