Hi Gilles, Thanks for your assistance.
I tried the recommended settings but got an error saying “sm” is no longer available in Open MPI 3.0+, and to use “vader” instead. I then tried with “--mca pml ob1 --mca btl self,vader” but ended up with the original error: [podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09899] MCW rank 2 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09898] MCW rank 1 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09897] MCW rank 0 is not bound (or bound to all available processors) Program received signal SIGILL: Illegal instruction. Backtrace for this error: #0 0xffffa202a917 in ??? #1 0xffffa20299a7 in ??? #2 0xffffa520079f in ??? #3 0xffffa1d0380c in ??? #4 0xffffa1d56fe7 in ??? #5 0xffffa1d57be7 in ??? #6 0xffffa1d5a5f7 in ??? #7 0xffffa1d5b35b in ??? #8 0xffffa17b8db7 in get_print_name_buffer at util/name_fns.c:106 #9 0xffffa17b8e1b in orte_util_print_jobids at util/name_fns.c:171 #10 0xffffa17b91eb in orte_util_print_name_args at util/name_fns.c:143 #11 0xffffa1822e93 in _process_name_print_for_opal at runtime/orte_init.c:68 #12 0xffff9ebe5e6f in process_event at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/pmix/pmix3x/pmix3x.c:255 #13 0xffffa16ec3cf in event_process_active_single_queue at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1370 #14 0xffffa16ec3cf in event_process_active at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1440 #15 0xffffa16ec3cf in opal_libevent2022_event_base_loop at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1644 #16 0xffffa16a9d93 in progress_engine at runtime/opal_progress_threads.c:105 #17 0xffffa1e678b7 in ??? #18 0xffffa1d03afb in ??? #19 0xffffffffffffffff in ??? The typical mpiexec options for each job include “-np 4 --allow-run-as-root --bind-to none --report-bindings” and a “-x LD_LIBRARY_PATH=…” which passes the HPC-X and application environment. I will get back to you with a core dump once I figure out the best way to generate and retrieve it from within our CI infrastructure. Thanks again! Regards, Greg From: users <users-boun...@lists.open-mpi.org> On Behalf Of Gilles Gouaillardet via users Sent: Tuesday, April 16, 2024 12:59 AM To: Open MPI Users <users@lists.open-mpi.org> Cc: Gilles Gouaillardet <gilles.gouaillar...@gmail.com> Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently Greg, If Open MPI was built with UCX, your jobs will likely use UCX (and the shared memory provider) even if running on a single node. You can mpirun --mca pml ob1 --mca btl self,sm ... if you want to avoid using UCX. What is a typical mpirun command line used under the hood by your "make test"? Though the warning might be ignored, SIGILL is definitely an issue. I encourage you to have your app dump a core in order to figure out where this is coming from Cheers, Gilles On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hello, We’re running into issues with jobs failing in a non-deterministic way when running multiple jobs concurrently within a “make test” framework. Make test is launched from within a shell script running inside a Podman container, and we’re typically running with “-j 20” and “-np 4” (20 jobs concurrently with 4 procs each). We’ve also tried reducing the number of jobs to no avail. Each time the battery of test cases is run, about 2 to 4 different jobs out of around 200 fail with the following errors: [podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available processors) Program received signal SIGILL: Illegal instruction. Some info about our setup: * Ampere Altra 80 core ARM machine * Open MPI 4.1.7a1 from HPC-X v2.18 * Rocky Linux 8.6 host, Rocky Linux 8.8 container * Podman 4.4.1 * This machine has a Mellanox Connect X-6 Lx NIC, however we’re avoiding the Mellanox software stack by running in a container, and these are single node jobs only We tried passing “—bind-to none” to the running jobs, and while this seemed to reduce the number of failing jobs on average, it didn’t eliminate the issue. We also encounter the following warning: [1712927028.412063] [podman-ci-rocky-8:3519 :0] sock.c:514 UCX WARN unable to read somaxconn value from /proc/sys/net/core/somaxconn file …however as far as I can tell this is probably unrelated and occurs because the associated file isn’t accessible inside the container, and after checking the UCX source I can see that SOMAXCONN is picked up from the system headers anyway. If anyone has hints about how to workaround this issue we’d greatly appreciate it! Thanks, Greg