We always advise against --allow-run-as-root

Just saying that in my experience it is common to run IO tests as root.
I agree - run the thing as a normal user and if that means a few seconds
extra work to set up permissionon on the target filesystem it is time well
spent.

On Fri, 8 Mar 2024 at 14:55, Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> (sorry this is so long – it's a bunch of explanations followed by 2
> suggestions at the bottom)
>
> One additional thing worth mentioning is that your mpirun command line
> does not seem to explicitly be asking for the "ucx" PML component, but the
> error message you're getting indicates that you specifically asked for the
> "ucx" PML.  Here's your command line, line-broken and re-ordered for ease
> of reading:
>
> /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun \
>
>     -np 1 \
>
>     -map-by ppr:1:node \
>
>     --allow-run-as-root \
>
>     --mca btl '^openib' \
>     --mca btl_openib_warn_default_gid_prefix 0 \
>
>     --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 \
>
>     --mca plm_base_verbose 0 \
>
>     --mca plm rsh \
>
>     /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest
>
> A few things of note on your parameters:
>
>
>    - With the "btl" parameter, you're specifically telling Open MPI to
>    skip using the openib​ component.  But then you pass in 2 btl_openib_*​
>    parameters, anyway (which will just be ignored, because you told Open MPI
>    to not use openib​).  This is harmless, but worth mentioning.
>    - You explicitly set plm_base_verbose​ to 0, but 0 is the default
>    value.  Again, this is harmless (i.e., it's unnecessary because you're
>    setting it to the same as the default value), but I thought I'd point it
>    out.
>    - You're explicitly setting the plm​ value (Program Launch Module –
>    i.e., how Open MPI launches remote executables), but you're not specifying
>    any remote hosts.  In this local-only case, Open MPI will effectively just
>    fork/exec the process locally.  So specifying the plm​ isn't needed.
>    Again, harmless, but I thought I'd point it out.
>    - We always advise against --allow-run-as-root​.  If you have a strong
>    need for it, ok – that's what it's there for, after all – but it definitely
>    isn't recommended.
>
>
> I suspect you have some environment variables and/or a config file that is
> telling Open MPI to set the pml​ to ucx​ (perhaps from your environment
> modules?).  Look in your environment for OMPI_mca_pml=ucx​, or something
> similar.
>
> That being said, the command line always trumps environment variables and
> config files in Open MPI.  So what Howard said – mpirun --mca pml '^ucx'
> ...​ – will effectively override any env variable or config file
> specifications telling Open MPI to use the UCX PML.
>
> And all *that*​ being said, the full error message says that the UCX PML
> may not have been able to be loaded.  That might mean that the UCX PML
> isn't present (i.e., that plugin literally isn't present in the
> filesystem), but it may also mean that the plugin was present and Open MPI
> tried to load it, and failed.  This typically means that shared library
> dependencies of that plugin weren't able to be loaded by the linker, so the
> linker gave up and simply told Open MPI "sorry, I can't dynamically open
> that plugin."  Open MPI basically just passed on the error to you.
>
> To figure out which is the case, you might want to run with mpirun --mca
> mca_component_show_load_errors 1 ...​  This will tell Open MPI to display
> errors when it tries to load a plugin, but fails (e.g, due to the linker
> not being able to find dependent libraries).  This is probably what I would
> do first – you might find that the dgx-14 node either is missing some
> libraries, or your LD_LIBRARY_PATH is not set correctly to find dependent
> libraries, or somesuch.
>
> Hope that helps!
>
>
> ------------------------------
> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Pritchard
> Jr., Howard via users <users@lists.open-mpi.org>
> *Sent:* Thursday, March 7, 2024 3:01 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Cc:* Pritchard Jr., Howard <howa...@lanl.gov>
> *Subject:* Re: [OMPI users] [EXTERNAL] Help deciphering error message
>
>
> Hello Jeffrey,
>
>
>
> A couple of things to try first.
>
>
>
> Try running without UCX.  Add –-mca pml ^ucx to the mpirun command line.
> If the app functions without ucx, then the next thing is to see what may be
> going wrong with UCX and the Open MPI components that use it.
>
>
>
> You may want to set the UCX_LOG_LEVEL environment variable to see if Open
> MPI’s UCX PML component is actually able to initialize UCX and start trying
> to use it.
>
>
>
> See https://openucx.readthedocs.io/en/master/faq.html  for an example to
> do this using mpirun and the type of output you should be getting.
>
>
>
> Another simple thing to try is
>
>
>
> mpirun -np 1 ucx_info -v
>
>
>
> and see it you get something like this back on stdout:
>
>  Library version: 1.14.0
>
> # Library path: /usr/lib64/libucs.so.0
>
> # API headers version: 1.14.0
>
> # Git branch '', revision f8877c5
>
> # Configured with: --build=aarch64-redhat-linux-gnu
> --host=aarch64-redhat-linux-gnu --program-prefix=
> --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr
> --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
> --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64
> --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib
> --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations
> --disable-logging --disable-debug --disable-assertions --enable-mt
> --disable-params-check --without-go --without-java --enable-cma --with-cuda
> --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm
> --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.7
>
> Are you running the mpirun command on dgx-14?  If that’s a different host
> a likely problem is that for some reason, the information in your
> ucx/1.10.1 is not getting picked up on dgx-14.
>
>
>
> One other thing, if the module UCX module name is indicating the version
> of UCX, its rather old.  I’d suggest, if possible, updating to a newer
> version, like 1.14.1 or newer.  There are many enhancements in more recent
> versions of UCX for GPU support and I would bet you’d want that for your
> DGX boxes.
>
>
>
> Howard
>
>
>
> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey
> Layton via users <users@lists.open-mpi.org>
> *Reply-To: *Open MPI Users <users@lists.open-mpi.org>
> *Date: *Thursday, March 7, 2024 at 11:53 AM
> *To: *Open MPI Users <users@lists.open-mpi.org>
> *Cc: *Jeffrey Layton <layto...@gmail.com>
> *Subject: *[EXTERNAL] [OMPI users] Help deciphering error message
>
>
>
> Good afternoon,
>
>
>
> I'm getting an error message I'm not sure how to use to debug an issue.
> I'll try to give you all of the pertinent about the setup, but I didn't
> build the system nor install the software. It's an NVIDIA SuperPod system
> with Base Command Manager 10.0.
>
>
>
> I'm building IOR but I'm really interested in mdtest. "module list" says
> I'm using the following modules:
>
>
>
> gcc/64/4.1.5a1
>
> ucx/1.10.1
>
> openmpi4/gcc/4.1.5
>
>
>
> There are no problems building the code.
>
>
>
> I'm using Slurm to run mdtest using a script. The output from the script
> and Slurm is the following (the command to run it is included).
>
>
>
>
>
> /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 1
> -map-by ppr:1:node --allow-run-as-root --mca
> btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude
> mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0
>
>  --mca plm rsh /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d
> /raid/bcm/mdtest
>
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.  This
>
> means that this component is either not installed or is unable to be
>
> used on your system (e.g., sometimes this means that shared libraries
>
> that the component requires are unable to be found/loaded).  Note that
>
> Open MPI stopped checking at the first component that it did not find.
>
>
>
> Host:      dgx-14
>
> Framework: pml
>
> Component: ucx
>
> --------------------------------------------------------------------------
>
> [dgx-14:4055623] [[42340,0],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file util/show_help.c at line 501
>
> [dgx-14:4055632] *** An error occurred in MPI_Init
>
> [dgx-14:4055632] *** reported by process [2774794241,0]
>
> [dgx-14:4055632] *** on a NULL communicator
>
> [dgx-14:4055632] *** Unknown error
>
> [dgx-14:4055632] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
>
> [dgx-14:4055632] ***    and potentially your MPI job)
>
>
>
>
>
> Any pointers/help is greatly appreciated.
>
>
>
> Thanks!
>
>
>
> Jeff
>
>
>
>
>
>
>
>
>
> [image: Image removed by sender.]
> <https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>
>
> Virus-free.www.avast.com
> <https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>
>
>
>

Reply via email to