Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently
Hi Gilles, Thanks for your assistance. I tried the recommended settings but got an error saying “sm” is no longer available in Open MPI 3.0+, and to use “vader” instead. I then tried with “--mca pml ob1 --mca btl self,vader” but ended up with the original error: [podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09899] MCW rank 2 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09898] MCW rank 1 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09897] MCW rank 0 is not bound (or bound to all available processors) Program received signal SIGILL: Illegal instruction. Backtrace for this error: #0 0xa202a917 in ??? #1 0xa20299a7 in ??? #2 0xa520079f in ??? #3 0xa1d0380c in ??? #4 0xa1d56fe7 in ??? #5 0xa1d57be7 in ??? #6 0xa1d5a5f7 in ??? #7 0xa1d5b35b in ??? #8 0xa17b8db7 in get_print_name_buffer at util/name_fns.c:106 #9 0xa17b8e1b in orte_util_print_jobids at util/name_fns.c:171 #10 0xa17b91eb in orte_util_print_name_args at util/name_fns.c:143 #11 0xa1822e93 in _process_name_print_for_opal at runtime/orte_init.c:68 #12 0x9ebe5e6f in process_event at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/pmix/pmix3x/pmix3x.c:255 #13 0xa16ec3cf in event_process_active_single_queue at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1370 #14 0xa16ec3cf in event_process_active at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1440 #15 0xa16ec3cf in opal_libevent2022_event_base_loop at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1644 #16 0xa16a9d93 in progress_engine at runtime/opal_progress_threads.c:105 #17 0xa1e678b7 in ??? #18 0xa1d03afb in ??? #19 0x in ??? The typical mpiexec options for each job include “-np 4 --allow-run-as-root --bind-to none --report-bindings” and a “-x LD_LIBRARY_PATH=…” which passes the HPC-X and application environment. I will get back to you with a core dump once I figure out the best way to generate and retrieve it from within our CI infrastructure. Thanks again! Regards, Greg From: users On Behalf Of Gilles Gouaillardet via users Sent: Tuesday, April 16, 2024 12:59 AM To: Open MPI Users Cc: Gilles Gouaillardet Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently Greg, If Open MPI was built with UCX, your jobs will likely use UCX (and the shared memory provider) even if running on a single node. You can mpirun --mca pml ob1 --mca btl self,sm ... if you want to avoid using UCX. What is a typical mpirun command line used under the hood by your "make test"? Though the warning might be ignored, SIGILL is definitely an issue. I encourage you to have your app dump a core in order to figure out where this is coming from Cheers, Gilles On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users mailto:users@lists.open-mpi.org>> wrote: Hello, We’re running into issues with jobs failing in a non-deterministic way when running multiple jobs concurrently within a “make test” framework. Make test is launched from within a shell script running inside a Podman container, and we’re typically running with “-j 20” and “-np 4” (20 jobs concurrently with 4 procs each). We’ve also tried reducing the number of jobs to no avail. Each time the battery of test cases is run, about 2 to 4 different jobs out of around 200 fail with the following errors: [podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available processors) Program received signal SIGILL: Illegal instruction. Some info about our setup: * Ampere Altra 80 core ARM machine * Open MPI 4.1.7a1 from HPC-X v2.18 * Rocky Linux 8.6 host, Rocky Linux 8.8 container * Podman 4.4.1 * This machine has a Mellanox Connect X-6 Lx NIC, however we’re avoiding the Mellanox software stack by running in a container, and these are single node jobs only We tried passing “—bind-to none” to the running jobs, and while this seemed to reduce the number of failing jobs on average
[OMPI users] Helping interpreting error output
Good afternoon MPI fans of all ages, Yet again, I'm getting an error that I'm having trouble interpreting. This time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA DGX A100 with multiple NICs. The ultimate command is the following: /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4 -map-by ppr:4:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh /home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test It was suggested to me to use these MPI options. The error I get is the following. -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: dgx-02 Framework: pml Component: ucx -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -- [dgx-02:2399932] *** An error occurred in MPI_Init [dgx-02:2399932] *** reported by process [2099773441,3] [dgx-02:2399932] *** on a NULL communicator [dgx-02:2399932] *** Unknown error [dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [dgx-02:2399932] ***and potentially your MPI job) My first inclination was that it couldn't find ucx. So I loaded that module and re-ran it. I get the exact same error message. I'm still checking if the ucx module gets loaded when I run via Slurm, but mdtest ran without issue. But I'm checking that. Any thoughts? Thanks! Jeff
Re: [OMPI users] [EXTERNAL] Helping interpreting error output
Hi Jeffrey, I would suggest trying to debug what may be going wrong with UCX on your DGX box. There are several things to try from the UCX faq - https://openucx.readthedocs.io/en/master/faq.html I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and see if UCX says something about what’s going wrong. Also add --mca plm_base_verbose 10 to the mpirun command line. Have you used DGX boxes with only a single NIC successfully? Howard From: users on behalf of Jeffrey Layton via users Reply-To: Open MPI Users Date: Tuesday, April 16, 2024 at 12:30 PM To: Open MPI Users Cc: Jeffrey Layton Subject: [EXTERNAL] [OMPI users] Helping interpreting error output Good afternoon MPI fans of all ages, Yet again, I'm getting an error that I'm having trouble interpreting. This time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA DGX A100 with multiple NICs. The ultimate command is the following: /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4 -map-by ppr:4:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh /home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test It was suggested to me to use these MPI options. The error I get is the following. -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: dgx-02 Framework: pml Component: ucx -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -- [dgx-02:2399932] *** An error occurred in MPI_Init [dgx-02:2399932] *** reported by process [2099773441,3] [dgx-02:2399932] *** on a NULL communicator [dgx-02:2399932] *** Unknown error [dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [dgx-02:2399932] ***and potentially your MPI job) My first inclination was that it couldn't find ucx. So I loaded that module and re-ran it. I get the exact same error message. I'm still checking if the ucx module gets loaded when I run via Slurm, but mdtest ran without issue. But I'm checking that. Any thoughts? Thanks! Jeff