We always advise against --allow-run-as-root Just saying that in my experience it is common to run IO tests as root. I agree - run the thing as a normal user and if that means a few seconds extra work to set up permissionon on the target filesystem it is time well spent.
On Fri, 8 Mar 2024 at 14:55, Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > (sorry this is so long – it's a bunch of explanations followed by 2 > suggestions at the bottom) > > One additional thing worth mentioning is that your mpirun command line > does not seem to explicitly be asking for the "ucx" PML component, but the > error message you're getting indicates that you specifically asked for the > "ucx" PML. Here's your command line, line-broken and re-ordered for ease > of reading: > > /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun \ > > -np 1 \ > > -map-by ppr:1:node \ > > --allow-run-as-root \ > > --mca btl '^openib' \ > --mca btl_openib_warn_default_gid_prefix 0 \ > > --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 \ > > --mca plm_base_verbose 0 \ > > --mca plm rsh \ > > /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest > > A few things of note on your parameters: > > > - With the "btl" parameter, you're specifically telling Open MPI to > skip using the openib component. But then you pass in 2 btl_openib_* > parameters, anyway (which will just be ignored, because you told Open MPI > to not use openib). This is harmless, but worth mentioning. > - You explicitly set plm_base_verbose to 0, but 0 is the default > value. Again, this is harmless (i.e., it's unnecessary because you're > setting it to the same as the default value), but I thought I'd point it > out. > - You're explicitly setting the plm value (Program Launch Module – > i.e., how Open MPI launches remote executables), but you're not specifying > any remote hosts. In this local-only case, Open MPI will effectively just > fork/exec the process locally. So specifying the plm isn't needed. > Again, harmless, but I thought I'd point it out. > - We always advise against --allow-run-as-root. If you have a strong > need for it, ok – that's what it's there for, after all – but it definitely > isn't recommended. > > > I suspect you have some environment variables and/or a config file that is > telling Open MPI to set the pml to ucx (perhaps from your environment > modules?). Look in your environment for OMPI_mca_pml=ucx, or something > similar. > > That being said, the command line always trumps environment variables and > config files in Open MPI. So what Howard said – mpirun --mca pml '^ucx' > ... – will effectively override any env variable or config file > specifications telling Open MPI to use the UCX PML. > > And all *that* being said, the full error message says that the UCX PML > may not have been able to be loaded. That might mean that the UCX PML > isn't present (i.e., that plugin literally isn't present in the > filesystem), but it may also mean that the plugin was present and Open MPI > tried to load it, and failed. This typically means that shared library > dependencies of that plugin weren't able to be loaded by the linker, so the > linker gave up and simply told Open MPI "sorry, I can't dynamically open > that plugin." Open MPI basically just passed on the error to you. > > To figure out which is the case, you might want to run with mpirun --mca > mca_component_show_load_errors 1 ... This will tell Open MPI to display > errors when it tries to load a plugin, but fails (e.g, due to the linker > not being able to find dependent libraries). This is probably what I would > do first – you might find that the dgx-14 node either is missing some > libraries, or your LD_LIBRARY_PATH is not set correctly to find dependent > libraries, or somesuch. > > Hope that helps! > > > ------------------------------ > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Pritchard > Jr., Howard via users <users@lists.open-mpi.org> > *Sent:* Thursday, March 7, 2024 3:01 PM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* Pritchard Jr., Howard <howa...@lanl.gov> > *Subject:* Re: [OMPI users] [EXTERNAL] Help deciphering error message > > > Hello Jeffrey, > > > > A couple of things to try first. > > > > Try running without UCX. Add –-mca pml ^ucx to the mpirun command line. > If the app functions without ucx, then the next thing is to see what may be > going wrong with UCX and the Open MPI components that use it. > > > > You may want to set the UCX_LOG_LEVEL environment variable to see if Open > MPI’s UCX PML component is actually able to initialize UCX and start trying > to use it. > > > > See https://openucx.readthedocs.io/en/master/faq.html for an example to > do this using mpirun and the type of output you should be getting. > > > > Another simple thing to try is > > > > mpirun -np 1 ucx_info -v > > > > and see it you get something like this back on stdout: > > Library version: 1.14.0 > > # Library path: /usr/lib64/libucs.so.0 > > # API headers version: 1.14.0 > > # Git branch '', revision f8877c5 > > # Configured with: --build=aarch64-redhat-linux-gnu > --host=aarch64-redhat-linux-gnu --program-prefix= > --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr > --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc > --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 > --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib > --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations > --disable-logging --disable-debug --disable-assertions --enable-mt > --disable-params-check --without-go --without-java --enable-cma --with-cuda > --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm > --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.7 > > Are you running the mpirun command on dgx-14? If that’s a different host > a likely problem is that for some reason, the information in your > ucx/1.10.1 is not getting picked up on dgx-14. > > > > One other thing, if the module UCX module name is indicating the version > of UCX, its rather old. I’d suggest, if possible, updating to a newer > version, like 1.14.1 or newer. There are many enhancements in more recent > versions of UCX for GPU support and I would bet you’d want that for your > DGX boxes. > > > > Howard > > > > *From: *users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey > Layton via users <users@lists.open-mpi.org> > *Reply-To: *Open MPI Users <users@lists.open-mpi.org> > *Date: *Thursday, March 7, 2024 at 11:53 AM > *To: *Open MPI Users <users@lists.open-mpi.org> > *Cc: *Jeffrey Layton <layto...@gmail.com> > *Subject: *[EXTERNAL] [OMPI users] Help deciphering error message > > > > Good afternoon, > > > > I'm getting an error message I'm not sure how to use to debug an issue. > I'll try to give you all of the pertinent about the setup, but I didn't > build the system nor install the software. It's an NVIDIA SuperPod system > with Base Command Manager 10.0. > > > > I'm building IOR but I'm really interested in mdtest. "module list" says > I'm using the following modules: > > > > gcc/64/4.1.5a1 > > ucx/1.10.1 > > openmpi4/gcc/4.1.5 > > > > There are no problems building the code. > > > > I'm using Slurm to run mdtest using a script. The output from the script > and Slurm is the following (the command to run it is included). > > > > > > /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 1 > -map-by ppr:1:node --allow-run-as-root --mca > btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude > mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 > > --mca plm rsh /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d > /raid/bcm/mdtest > > -------------------------------------------------------------------------- > > A requested component was not found, or was unable to be opened. This > > means that this component is either not installed or is unable to be > > used on your system (e.g., sometimes this means that shared libraries > > that the component requires are unable to be found/loaded). Note that > > Open MPI stopped checking at the first component that it did not find. > > > > Host: dgx-14 > > Framework: pml > > Component: ucx > > -------------------------------------------------------------------------- > > [dgx-14:4055623] [[42340,0],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file util/show_help.c at line 501 > > [dgx-14:4055632] *** An error occurred in MPI_Init > > [dgx-14:4055632] *** reported by process [2774794241,0] > > [dgx-14:4055632] *** on a NULL communicator > > [dgx-14:4055632] *** Unknown error > > [dgx-14:4055632] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > > [dgx-14:4055632] *** and potentially your MPI job) > > > > > > Any pointers/help is greatly appreciated. > > > > Thanks! > > > > Jeff > > > > > > > > > > [image: Image removed by sender.] > <https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$> > > Virus-free.www.avast.com > <https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$> > > >