We've encountered OpenMPI crashing in handle_wc(), with following error message:
[.../opal/mca/btl/openib/btl_openib_component.c:3610:handle_wc]
Unhandled work completion opcode is 136
Our setup is admittedly little tricky, but I'm still worried that it
may be a genuine problem, so please bear wit
i.e., "-host node1:5" assigns 5 slots
> to node1.
>
> If tm support is included, then we read the PBS allocation and see one slot
> on each node - and launch accordingly.
>
>
> > On Jan 18, 2022, at 2:44 PM, Crni Gorac via users
> > wrote:
> >
>
ine. My guess is that you are using the ssh launcher - what is odd is
> that you should wind up with two procs on the first node, in which case those
> envars are correct. If you are seeing one proc on each node, then something
> is wrong.
>
>
> > On Jan 18, 2022, at 1:33 PM,
t; cmd line. My guess is that you are using the ssh launcher - what is odd is
> that you should wind up with two procs on the first node, in which case those
> envars are correct. If you are seeing one proc on each node, then something
> is wrong.
>
>
> > On Jan 18, 2022, at 1:33
:
>
> Afraid I can't understand your scenario - when you say you "submit a job" to
> run on two nodes, how many processes are you running on each node??
>
>
> > On Jan 18, 2022, at 1:07 PM, Crni Gorac via users
> > wrote:
> >
> > Using Ope
Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
have PBS 18.1.4 installed on my cluster (cluster nodes are running
CentOS 7.9). When I try to submit a job that will run on two nodes in
the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
instead of 1, and OMPI_CO