[OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently

2024-04-15 Thread Greg Samonds via users
Hello,

We're running into issues with jobs failing in a non-deterministic way when 
running multiple jobs concurrently within a "make test" framework.

Make test is launched from within a shell script running inside a Podman 
container, and we're typically running with "-j 20" and "-np 4" (20 jobs 
concurrently with 4 procs each).  We've also tried reducing the number of jobs 
to no avail.  Each time the battery of test cases is run, about 2 to 4 
different jobs out of around 200 fail with the following errors:

[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available 
processors)

Program received signal SIGILL: Illegal instruction.
Some info about our setup:

  *   Ampere Altra 80 core ARM machine
  *   Open MPI 4.1.7a1 from HPC-X v2.18
  *   Rocky Linux 8.6 host, Rocky Linux 8.8 container
  *   Podman 4.4.1
  *   This machine has a Mellanox Connect X-6 Lx NIC, however we're avoiding 
the Mellanox software stack by running in a container, and these are single 
node jobs only

We tried passing "-bind-to none" to the running jobs, and while this seemed to 
reduce the number of failing jobs on average, it didn't eliminate the issue.

We also encounter the following warning:

[1712927028.412063] [podman-ci-rocky-8:3519 :0]sock.c:514  UCX  
WARN  unable to read somaxconn value from /proc/sys/net/core/somaxconn file

...however as far as I can tell this is probably unrelated and occurs because 
the associated file isn't accessible inside the container, and after checking 
the UCX source I can see that SOMAXCONN is picked up from the system headers 
anyway.

If anyone has hints about how to workaround this issue we'd greatly appreciate 
it!

Thanks,
Greg


Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently

2024-04-15 Thread Gilles Gouaillardet via users
Greg,

If Open MPI was built with UCX, your jobs will likely use UCX (and the
shared memory provider) even if running on a single node.
You can
mpirun --mca pml ob1 --mca btl self,sm ...
if you want to avoid using UCX.

What is a typical mpirun command line used under the hood by your "make
test"?
Though the warning might be ignored, SIGILL is definitely an issue.
I encourage you to have your app dump a core in order to figure out where
this is coming from


Cheers,

Gilles

On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
>
>
> We’re running into issues with jobs failing in a non-deterministic way
> when running multiple jobs concurrently within a “make test” framework.
>
>
>
> Make test is launched from within a shell script running inside a Podman
> container, and we’re typically running with “-j 20” and “-np 4” (20 jobs
> concurrently with 4 procs each).  We’ve also tried reducing the number of
> jobs to no avail.  Each time the battery of test cases is run, about 2 to 4
> different jobs out of around 200 fail with the following errors:
>
>
>
>
> *[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all
> available processors) [podman-ci-rocky-8.8:03540] MCW rank 3 is not bound
> (or bound to all available processors) [podman-ci-rocky-8.8:03519] MCW rank
> 0 is not bound (or bound to all available processors)
> [podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all
> available processors) *
>
> *Program received signal SIGILL: Illegal instruction.*
>
> Some info about our setup:
>
>- Ampere Altra 80 core ARM machine
>- Open MPI 4.1.7a1 from HPC-X v2.18
>- Rocky Linux 8.6 host, Rocky Linux 8.8 container
>- Podman 4.4.1
>- This machine has a Mellanox Connect X-6 Lx NIC, however we’re
>avoiding the Mellanox software stack by running in a container, and these
>are single node jobs only
>
>
>
> We tried passing “—bind-to none” to the running jobs, and while this
> seemed to reduce the number of failing jobs on average, it didn’t eliminate
> the issue.
>
>
>
> We also encounter the following warning:
>
>
>
> *[1712927028.412063] [**podman-ci-rocky-8:3519 :0]sock.c:514
> UCX  WARN  unable to read somaxconn value from /proc/sys/net/core/somaxconn
> file*
>
>
>
> …however as far as I can tell this is probably unrelated and occurs
> because the associated file isn’t accessible inside the container, and
> after checking the UCX source I can see that SOMAXCONN is picked up from
> the system headers anyway.
>
>
>
> If anyone has hints about how to workaround this issue we’d greatly
> appreciate it!
>
>
>
> Thanks,
>
> Greg
>