(sorry this is so long – it's a bunch of explanations followed by 2 suggestions at the bottom)
One additional thing worth mentioning is that your mpirun command line does not seem to explicitly be asking for the "ucx" PML component, but the error message you're getting indicates that you specifically asked for the "ucx" PML. Here's your command line, line-broken and re-ordered for ease of reading: /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun \ -np 1 \ -map-by ppr:1:node \ --allow-run-as-root \ --mca btl '^openib' \ --mca btl_openib_warn_default_gid_prefix 0 \ --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 \ --mca plm_base_verbose 0 \ --mca plm rsh \ /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest A few things of note on your parameters: * With the "btl" parameter, you're specifically telling Open MPI to skip using the openib component. But then you pass in 2 btl_openib_* parameters, anyway (which will just be ignored, because you told Open MPI to not use openib). This is harmless, but worth mentioning. * You explicitly set plm_base_verbose to 0, but 0 is the default value. Again, this is harmless (i.e., it's unnecessary because you're setting it to the same as the default value), but I thought I'd point it out. * You're explicitly setting the plm value (Program Launch Module – i.e., how Open MPI launches remote executables), but you're not specifying any remote hosts. In this local-only case, Open MPI will effectively just fork/exec the process locally. So specifying the plm isn't needed. Again, harmless, but I thought I'd point it out. * We always advise against --allow-run-as-root. If you have a strong need for it, ok – that's what it's there for, after all – but it definitely isn't recommended. I suspect you have some environment variables and/or a config file that is telling Open MPI to set the pml to ucx (perhaps from your environment modules?). Look in your environment for OMPI_mca_pml=ucx, or something similar. That being said, the command line always trumps environment variables and config files in Open MPI. So what Howard said – mpirun --mca pml '^ucx' ... – will effectively override any env variable or config file specifications telling Open MPI to use the UCX PML. And all that being said, the full error message says that the UCX PML may not have been able to be loaded. That might mean that the UCX PML isn't present (i.e., that plugin literally isn't present in the filesystem), but it may also mean that the plugin was present and Open MPI tried to load it, and failed. This typically means that shared library dependencies of that plugin weren't able to be loaded by the linker, so the linker gave up and simply told Open MPI "sorry, I can't dynamically open that plugin." Open MPI basically just passed on the error to you. To figure out which is the case, you might want to run with mpirun --mca mca_component_show_load_errors 1 ... This will tell Open MPI to display errors when it tries to load a plugin, but fails (e.g, due to the linker not being able to find dependent libraries). This is probably what I would do first – you might find that the dgx-14 node either is missing some libraries, or your LD_LIBRARY_PATH is not set correctly to find dependent libraries, or somesuch. Hope that helps! ________________________________ From: users <users-boun...@lists.open-mpi.org> on behalf of Pritchard Jr., Howard via users <users@lists.open-mpi.org> Sent: Thursday, March 7, 2024 3:01 PM To: Open MPI Users <users@lists.open-mpi.org> Cc: Pritchard Jr., Howard <howa...@lanl.gov> Subject: Re: [OMPI users] [EXTERNAL] Help deciphering error message Hello Jeffrey, A couple of things to try first. Try running without UCX. Add –-mca pml ^ucx to the mpirun command line. If the app functions without ucx, then the next thing is to see what may be going wrong with UCX and the Open MPI components that use it. You may want to set the UCX_LOG_LEVEL environment variable to see if Open MPI’s UCX PML component is actually able to initialize UCX and start trying to use it. See https://openucx.readthedocs.io/en/master/faq.html for an example to do this using mpirun and the type of output you should be getting. Another simple thing to try is mpirun -np 1 ucx_info -v and see it you get something like this back on stdout: Library version: 1.14.0 # Library path: /usr/lib64/libucs.so.0 # API headers version: 1.14.0 # Git branch '', revision f8877c5 # Configured with: --build=aarch64-redhat-linux-gnu --host=aarch64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.7 Are you running the mpirun command on dgx-14? If that’s a different host a likely problem is that for some reason, the information in your ucx/1.10.1 is not getting picked up on dgx-14. One other thing, if the module UCX module name is indicating the version of UCX, its rather old. I’d suggest, if possible, updating to a newer version, like 1.14.1 or newer. There are many enhancements in more recent versions of UCX for GPU support and I would bet you’d want that for your DGX boxes. Howard From: users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey Layton via users <users@lists.open-mpi.org> Reply-To: Open MPI Users <users@lists.open-mpi.org> Date: Thursday, March 7, 2024 at 11:53 AM To: Open MPI Users <users@lists.open-mpi.org> Cc: Jeffrey Layton <layto...@gmail.com> Subject: [EXTERNAL] [OMPI users] Help deciphering error message Good afternoon, I'm getting an error message I'm not sure how to use to debug an issue. I'll try to give you all of the pertinent about the setup, but I didn't build the system nor install the software. It's an NVIDIA SuperPod system with Base Command Manager 10.0. I'm building IOR but I'm really interested in mdtest. "module list" says I'm using the following modules: gcc/64/4.1.5a1 ucx/1.10.1 openmpi4/gcc/4.1.5 There are no problems building the code. I'm using Slurm to run mdtest using a script. The output from the script and Slurm is the following (the command to run it is included). /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 1 -map-by ppr:1:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: dgx-14 Framework: pml Component: ucx -------------------------------------------------------------------------- [dgx-14:4055623] [[42340,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501 [dgx-14:4055632] *** An error occurred in MPI_Init [dgx-14:4055632] *** reported by process [2774794241,0] [dgx-14:4055632] *** on a NULL communicator [dgx-14:4055632] *** Unknown error [dgx-14:4055632] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [dgx-14:4055632] *** and potentially your MPI job) Any pointers/help is greatly appreciated. Thanks! Jeff [Image removed by sender.]<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$> Virus-free.www.avast.com<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>