Hi Mehmet, Gilles, Thanks for your support on this topic.
* I gave "--mca pml ^ucx" a try but unfortunately the jobs failed with “ MPI_INIT has failed because at least one MPI process is unreachable from another”. * We use a Python-based launcher which launches an mpiexec command through a subprocess, and the SIGILL error occurs immediately after this - before our Fortran application prints out any information or begins the simulation. * The “[podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all available processors)” messages can also occur in cases which are successful, and do not crash with a SIGILL (I only just realized this, so the subject of this email is not correct, sorry about that). * We do actually apply the full HPC-X environment. Our Python launcher has a mechanism which launches a shell and runs “source hpcx-init.sh” and “hpcx_load”, and it then copies back this environment so it can be passed to the mpiexec command. * We don’t encounter this issue when running in the same context using the Intel MPI/x86_64 version of our software, which uses the same source code branch and only differs in the libraries its linked with. * The full execution context is: Jenkins (Groovy-based) -> Python script -> Podman container -> Shell script -> Make test -> Python (launcher application) -> MPI Fortran application * We can’t reproduce the issue when omitting the first two steps by running on a similar machine outside of the CI (starting from the “Podman container” step) It’s very bizarre that we can only reproduce this within our Jenkins CI system, but not while running it directly, even with the same container, hardware, OS, kernel, etc. For the lack of a better idea, I wonder if there could possibly be some strange interaction between the JVM (for Jenkins) running on the machine and MPI, but I don’t see how the operating system could allow something like this to happen. Perhaps we can try increasing the verbosity of MPI’s output and comparing what we get from within the CI to what we get locally. Would “--mca btl_base_verbose 100” and “--mca pml_base_verbose 100” be the best way to do this, or would you recommend something more specific for this situation? Regards, Greg From: Mehmet Oren <mehmet...@hotmail.com> Sent: Wednesday, April 17, 2024 5:11 PM To: Open MPI Users <users@lists.open-mpi.org> Cc: Greg Samonds <greg.samo...@esi-group.com>; Adnane Khattabi <adnane.khatt...@esi-group.com>; Philippe Rouchon <philippe.rouc...@esi-group.com> Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently Hi Greg, I am not an openmpi expert but I just wanted to share my experience with HPC-X. 1. Default HPC-X builds which come with the mofed drivers are built with UCX and as Gilles stated, specifying ob1 will not change the layer for openmpi. You can try to discard UCX and let the openmpi decide for the layer by adding "--mca pml ^ucx" to your command line. 1. HPC-X comes with two scripts named mpivars.sh and mpivars.csh respectively under bin folder. It could be a better option to source mpivars.sh before running your job instead of adding LD_LIBRARY_PATH. By sourcing this script, you can set up all required paths and environment variables easily and fix most of the run time problems. 1. And also, please check hwloc and it is dependencies which usually are not present with default os installations and container images. Regards, Mehmet ________________________________ From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on behalf of Greg Samonds via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Sent: Tuesday, April 16, 2024 5:50 PM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Greg Samonds <greg.samo...@esi-group.com<mailto:greg.samo...@esi-group.com>>; Adnane Khattabi <adnane.khatt...@esi-group.com<mailto:adnane.khatt...@esi-group.com>>; Philippe Rouchon <philippe.rouc...@esi-group.com<mailto:philippe.rouc...@esi-group.com>> Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently Hi Gilles, Thanks for your assistance. I tried the recommended settings but got an error saying “sm” is no longer available in Open MPI 3.0+, and to use “vader” instead. I then tried with “--mca pml ob1 --mca btl self,vader” but ended up with the original error: [podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09899] MCW rank 2 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09898] MCW rank 1 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:09897] MCW rank 0 is not bound (or bound to all available processors) Program received signal SIGILL: Illegal instruction. Backtrace for this error: #0 0xffffa202a917 in ??? #1 0xffffa20299a7 in ??? #2 0xffffa520079f in ??? #3 0xffffa1d0380c in ??? #4 0xffffa1d56fe7 in ??? #5 0xffffa1d57be7 in ??? #6 0xffffa1d5a5f7 in ??? #7 0xffffa1d5b35b in ??? #8 0xffffa17b8db7 in get_print_name_buffer at util/name_fns.c:106 #9 0xffffa17b8e1b in orte_util_print_jobids at util/name_fns.c:171 #10 0xffffa17b91eb in orte_util_print_name_args at util/name_fns.c:143 #11 0xffffa1822e93 in _process_name_print_for_opal at runtime/orte_init.c:68 #12 0xffff9ebe5e6f in process_event at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/pmix/pmix3x/pmix3x.c:255 #13 0xffffa16ec3cf in event_process_active_single_queue at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1370 #14 0xffffa16ec3cf in event_process_active at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1440 #15 0xffffa16ec3cf in opal_libevent2022_event_base_loop at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1644 #16 0xffffa16a9d93 in progress_engine at runtime/opal_progress_threads.c:105 #17 0xffffa1e678b7 in ??? #18 0xffffa1d03afb in ??? #19 0xffffffffffffffff in ??? The typical mpiexec options for each job include “-np 4 --allow-run-as-root --bind-to none --report-bindings” and a “-x LD_LIBRARY_PATH=…” which passes the HPC-X and application environment. I will get back to you with a core dump once I figure out the best way to generate and retrieve it from within our CI infrastructure. Thanks again! Regards, Greg From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Gilles Gouaillardet via users Sent: Tuesday, April 16, 2024 12:59 AM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Gilles Gouaillardet <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available processors)" when running multiple jobs concurrently Greg, If Open MPI was built with UCX, your jobs will likely use UCX (and the shared memory provider) even if running on a single node. You can mpirun --mca pml ob1 --mca btl self,sm ... if you want to avoid using UCX. What is a typical mpirun command line used under the hood by your "make test"? Though the warning might be ignored, SIGILL is definitely an issue. I encourage you to have your app dump a core in order to figure out where this is coming from Cheers, Gilles On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hello, We’re running into issues with jobs failing in a non-deterministic way when running multiple jobs concurrently within a “make test” framework. Make test is launched from within a shell script running inside a Podman container, and we’re typically running with “-j 20” and “-np 4” (20 jobs concurrently with 4 procs each). We’ve also tried reducing the number of jobs to no avail. Each time the battery of test cases is run, about 2 to 4 different jobs out of around 200 fail with the following errors: [podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available processors) Program received signal SIGILL: Illegal instruction. Some info about our setup: * Ampere Altra 80 core ARM machine * Open MPI 4.1.7a1 from HPC-X v2.18 * Rocky Linux 8.6 host, Rocky Linux 8.8 container * Podman 4.4.1 * This machine has a Mellanox Connect X-6 Lx NIC, however we’re avoiding the Mellanox software stack by running in a container, and these are single node jobs only We tried passing “—bind-to none” to the running jobs, and while this seemed to reduce the number of failing jobs on average, it didn’t eliminate the issue. We also encounter the following warning: [1712927028.412063] [podman-ci-rocky-8:3519 :0] sock.c:514 UCX WARN unable to read somaxconn value from /proc/sys/net/core/somaxconn file …however as far as I can tell this is probably unrelated and occurs because the associated file isn’t accessible inside the container, and after checking the UCX source I can see that SOMAXCONN is picked up from the system headers anyway. If anyone has hints about how to workaround this issue we’d greatly appreciate it! Thanks, Greg