Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently

2024-04-16 Thread Greg Samonds via users
Hi Gilles,

Thanks for your assistance.

I tried the recommended settings but got an error saying “sm” is no longer 
available in Open MPI 3.0+, and to use “vader” instead.  I then tried with 
“--mca pml ob1 --mca btl self,vader” but ended up with the original error:

[podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:09899] MCW rank 2 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:09898] MCW rank 1 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:09897] MCW rank 0 is not bound (or bound to all available 
processors)

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
#0  0xa202a917 in ???
#1  0xa20299a7 in ???
#2  0xa520079f in ???
#3  0xa1d0380c in ???
#4  0xa1d56fe7 in ???
#5  0xa1d57be7 in ???
#6  0xa1d5a5f7 in ???
#7  0xa1d5b35b in ???
#8  0xa17b8db7 in get_print_name_buffer
at util/name_fns.c:106
#9  0xa17b8e1b in orte_util_print_jobids
at util/name_fns.c:171
#10  0xa17b91eb in orte_util_print_name_args
at util/name_fns.c:143
#11  0xa1822e93 in _process_name_print_for_opal
at runtime/orte_init.c:68
#12  0x9ebe5e6f in process_event
at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/pmix/pmix3x/pmix3x.c:255
#13  0xa16ec3cf in event_process_active_single_queue
at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1370
#14  0xa16ec3cf in event_process_active
at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1440
#15  0xa16ec3cf in opal_libevent2022_event_base_loop
at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1644
#16  0xa16a9d93 in progress_engine
at runtime/opal_progress_threads.c:105
#17  0xa1e678b7 in ???
#18  0xa1d03afb in ???
#19  0x in ???

The typical mpiexec options for each job include “-np 4 --allow-run-as-root 
--bind-to none --report-bindings” and a “-x LD_LIBRARY_PATH=…” which passes the 
HPC-X and application environment.

I will get back to you with a core dump once I figure out the best way to 
generate and retrieve it from within our CI infrastructure.

Thanks again!

Regards,
Greg

From: users  On Behalf Of Gilles Gouaillardet 
via users
Sent: Tuesday, April 16, 2024 12:59 AM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently

Greg,

If Open MPI was built with UCX, your jobs will likely use UCX (and the shared 
memory provider) even if running on a single node.
You can
mpirun --mca pml ob1 --mca btl self,sm ...
if you want to avoid using UCX.

What is a typical mpirun command line used under the hood by your "make test"?
Though the warning might be ignored, SIGILL is definitely an issue.
I encourage you to have your app dump a core in order to figure out where this 
is coming from


Cheers,

Gilles

On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users 
mailto:users@lists.open-mpi.org>> wrote:
Hello,

We’re running into issues with jobs failing in a non-deterministic way when 
running multiple jobs concurrently within a “make test” framework.

Make test is launched from within a shell script running inside a Podman 
container, and we’re typically running with “-j 20” and “-np 4” (20 jobs 
concurrently with 4 procs each).  We’ve also tried reducing the number of jobs 
to no avail.  Each time the battery of test cases is run, about 2 to 4 
different jobs out of around 200 fail with the following errors:

[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available 
processors)

Program received signal SIGILL: Illegal instruction.
Some info about our setup:

  *   Ampere Altra 80 core ARM machine
  *   Open MPI 4.1.7a1 from HPC-X v2.18
  *   Rocky Linux 8.6 host, Rocky Linux 8.8 container
  *   Podman 4.4.1
  *   This machine has a Mellanox Connect X-6 Lx NIC, however we’re avoiding 
the Mellanox software stack by running in a container, and these are single 
node jobs only

We tried passing “—bind-to none” to the running jobs, and while this seemed to 
reduce the number of failing jobs on average

[OMPI users] Helping interpreting error output

2024-04-16 Thread Jeffrey Layton via users
Good afternoon MPI fans of all ages,

Yet again, I'm getting an error that I'm having trouble interpreting. This
time, I'm trying to run ior. I've done it a thousand times but not on an
NVIDIA DGX A100 with multiple NICs.

The ultimate command is the following:


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4
-map-by ppr:4:node --allow-run-as-root --mca
btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude
mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh
/home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test


It was suggested to me to use these MPI options. The error I get is the
following.

--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  dgx-02
Framework: pml
Component: ucx
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--
[dgx-02:2399932] *** An error occurred in MPI_Init
[dgx-02:2399932] *** reported by process [2099773441,3]
[dgx-02:2399932] *** on a NULL communicator
[dgx-02:2399932] *** Unknown error
[dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[dgx-02:2399932] ***and potentially your MPI job)


My first inclination was that it couldn't find ucx. So I loaded that module
and re-ran it. I get the exact same error message. I'm still checking if
the ucx module gets loaded when I run via Slurm, but mdtest ran without
issue. But I'm checking that.

Any thoughts?

Thanks!

Jeff


Re: [OMPI users] [EXTERNAL] Helping interpreting error output

2024-04-16 Thread Pritchard Jr., Howard via users
Hi Jeffrey,

I would suggest trying to debug what may be going wrong with UCX on your DGX 
box.

There are several things to try from the UCX faq - 
https://openucx.readthedocs.io/en/master/faq.html

I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and 
see if UCX says something about what’s going wrong.

Also add --mca plm_base_verbose 10 to the mpirun command line.

Have you used DGX boxes with only a single NIC successfully?

Howard


From: users  on behalf of Jeffrey Layton via 
users 
Reply-To: Open MPI Users 
Date: Tuesday, April 16, 2024 at 12:30 PM
To: Open MPI Users 
Cc: Jeffrey Layton 
Subject: [EXTERNAL] [OMPI users] Helping interpreting error output

Good afternoon MPI fans of all ages,

Yet again, I'm getting an error that I'm having trouble interpreting. This 
time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA 
DGX A100 with multiple NICs.

The ultimate command is the following:


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4 -map-by 
ppr:4:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca 
btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm 
rsh /home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test


It was suggested to me to use these MPI options. The error I get is the 
following.

--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  dgx-02
Framework: pml
Component: ucx
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--
[dgx-02:2399932] *** An error occurred in MPI_Init
[dgx-02:2399932] *** reported by process [2099773441,3]
[dgx-02:2399932] *** on a NULL communicator
[dgx-02:2399932] *** Unknown error
[dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[dgx-02:2399932] ***and potentially your MPI job)


My first inclination was that it couldn't find ucx. So I loaded that module and 
re-ran it. I get the exact same error message. I'm still checking if the ucx 
module gets loaded when I run via Slurm, but mdtest ran without issue. But I'm 
checking that.

Any thoughts?

Thanks!

Jeff