Prentice,
Avoiding the obvious question of whether your FM is running and the fabric is
in an active state, It sounds like your exhausting a resource on the cards.
Ralph is correct about support for QLogic cards being long past but I’ll see
what I can dig up in the archives on Monday to see if
Le 08/05/2020 à 21:56, Prentice Bisbal via users a écrit :
>
> We often get the following errors when more than one job runs on the
> same compute node. We are using Slurm with OpenMPI. The IB cards are
> QLogic using PSM:
>
> 10698ipath_userinit: assign_context command failed: Network is down
> no
That it! I was trying to remember what the setting was but I haven’t worked on
those HCAs since around 2012, so it was faint.
That said, I found the Intel TrueScale manual online at
https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/OFED_Host_Software_UserG
This material is working for nearly 10 years with several generations of
nodes and OpenMPI without any problem. Today it is possible to found
refurbished parts at low price on the web and it can help building small
clusters. it is really more efficient than 10Gb ethernet for parallel
codes due to t
How can I run OpenMPI's Memchecker on a process created by MPI_Comm_spawn()?
I've configured OpenMPI 4.0.3 for Memchecker, along with Valgrind 3.15.0 and it
works quite well on processes created directly by mpiexec.
I tried to do something analogous by pre-pending "valgrind" onto the command
Kurt,
the error is "valgrind myApp" is not an executable (but this is a
command a shell can interpret)
so you have several options:
- use a wrapper (e.g. myApp.valgrind) that forks&exec valgrind myApp)
- MPI_Comm_spawn("valgrind", argv, ...) after you inserted "myApp" at
the beginning of argv
-