Re: [OMPI users] [EXTERNAL] Help deciphering error message

2024-03-08 Thread Jeff Squyres (jsquyres) via users
(sorry this is so long – it's a bunch of explanations followed by 2 suggestions 
at the bottom)

One additional thing worth mentioning is that your mpirun command line does not 
seem to explicitly be asking for the "ucx" PML component, but the error message 
you're getting indicates that you specifically asked for the "ucx" PML.  Here's 
your command line, line-broken and re-ordered for ease of reading:


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun \

-np 1 \

-map-by ppr:1:node \

--allow-run-as-root \

--mca btl '^openib' \

--mca btl_openib_warn_default_gid_prefix 0 \

--mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 \

--mca plm_base_verbose 0 \

--mca plm rsh \

/home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest

A few things of note on your parameters:


  *
With the "btl" parameter, you're specifically telling Open MPI to skip using 
the openib​ component.  But then you pass in 2 btl_openib_*​ parameters, anyway 
(which will just be ignored, because you told Open MPI to not use openib​).  
This is harmless, but worth mentioning.
  *
You explicitly set plm_base_verbose​ to 0, but 0 is the default value.  Again, 
this is harmless (i.e., it's unnecessary because you're setting it to the same 
as the default value), but I thought I'd point it out.
  *
You're explicitly setting the plm​ value (Program Launch Module – i.e., how 
Open MPI launches remote executables), but you're not specifying any remote 
hosts.  In this local-only case, Open MPI will effectively just fork/exec the 
process locally.  So specifying the plm​ isn't needed.  Again, harmless, but I 
thought I'd point it out.
  *
We always advise against --allow-run-as-root​.  If you have a strong need for 
it, ok – that's what it's there for, after all – but it definitely isn't 
recommended.

I suspect you have some environment variables and/or a config file that is 
telling Open MPI to set the pml​ to ucx​ (perhaps from your environment 
modules?).  Look in your environment for OMPI_mca_pml=ucx​, or something 
similar.

That being said, the command line always trumps environment variables and 
config files in Open MPI.  So what Howard said – mpirun --mca pml '^ucx' ...​ – 
will effectively override any env variable or config file specifications 
telling Open MPI to use the UCX PML.

And all that​ being said, the full error message says that the UCX PML may not 
have been able to be loaded.  That might mean that the UCX PML isn't present 
(i.e., that plugin literally isn't present in the filesystem), but it may also 
mean that the plugin was present and Open MPI tried to load it, and failed.  
This typically means that shared library dependencies of that plugin weren't 
able to be loaded by the linker, so the linker gave up and simply told Open MPI 
"sorry, I can't dynamically open that plugin."  Open MPI basically just passed 
on the error to you.

To figure out which is the case, you might want to run with mpirun --mca 
mca_component_show_load_errors 1 ...​  This will tell Open MPI to display 
errors when it tries to load a plugin, but fails (e.g, due to the linker not 
being able to find dependent libraries).  This is probably what I would do 
first – you might find that the dgx-14 node either is missing some libraries, 
or your LD_LIBRARY_PATH is not set correctly to find dependent libraries, or 
somesuch.

Hope that helps!



From: users  on behalf of Pritchard Jr., 
Howard via users 
Sent: Thursday, March 7, 2024 3:01 PM
To: Open MPI Users 
Cc: Pritchard Jr., Howard 
Subject: Re: [OMPI users] [EXTERNAL] Help deciphering error message


Hello Jeffrey,



A couple of things to try first.



Try running without UCX.  Add –-mca pml ^ucx to the mpirun command line.  If 
the app functions without ucx, then the next thing is to see what may be going 
wrong with UCX and the Open MPI components that use it.



You may want to set the UCX_LOG_LEVEL environment variable to see if Open MPI’s 
UCX PML component is actually able to initialize UCX and start trying to use it.



See https://openucx.readthedocs.io/en/master/faq.html  for an example to do 
this using mpirun and the type of output you should be getting.



Another simple thing to try is



mpirun -np 1 ucx_info -v



and see it you get something like this back on stdout:

 Library version: 1.14.0

# Library path: /usr/lib64/libucs.so.0

# API headers version: 1.14.0

# Git branch '', revision f8877c5

# Configured with: --build=aarch64-redhat-linux-gnu 
--host=aarch64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking 
--prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin 
--sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include 
--libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var 
--sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info 
--disable-optimizations --disable-logging --disable-debug --disable-assertions 
--enable-mt -

Re: [OMPI users] [EXTERNAL] Help deciphering error message

2024-03-08 Thread Jeffrey Layton via users
Howard, Jeff,

Thanks for the replies and the pointers. I started debugging this morning
and discovered I wasn't specifically using ucx but I had the module loaded.
So I removed the ucx module, added Howard's suggestion about removing ucx
on the CL, and it worked. UCX removal for the win.

Now, going back to the full command line... . This is actually the result
of a script calling a script, which uses sbatch to submit the job. Too
convoluted I know, but it's my starting point from some old previous work
by others. I will be cleaning up the scripts of course so I appreciate the
explanations Jeff!

My OMPI env variables are really simple:

OMPI_MCA_btl=^openib,smcuda
OMPI_MCA_pml=UCX
OMPI_MCA_coll_hcoll_enable=0

I'm checking which of these get set by which module. I suspect that the mpl
env variable is set when I load the ucx module. (not quite sure why it
wasn't "unset" after the module is removed - need to check that).

One mea culpa - I sometimes set default values in situations where these
change. I'm not saying this is the case in Open MPI, but I've seen other
packages in HPC, where the defaults like to periodically change. Also,
sometimes it helps me debug.

Thanks Howard and Jeff!!! I really appreciate it. Now I need to improve and
keep up my Open MPI skills as poor as they are.

Jeff





Virus-free.www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Fri, Mar 8, 2024 at 9:54 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> (sorry this is so long – it's a bunch of explanations followed by 2
> suggestions at the bottom)
>
> One additional thing worth mentioning is that your mpirun command line
> does not seem to explicitly be asking for the "ucx" PML component, but the
> error message you're getting indicates that you specifically asked for the
> "ucx" PML.  Here's your command line, line-broken and re-ordered for ease
> of reading:
>
> /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun \
>
> -np 1 \
>
> -map-by ppr:1:node \
>
> --allow-run-as-root \
>
> --mca btl '^openib' \
> --mca btl_openib_warn_default_gid_prefix 0 \
>
> --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 \
>
> --mca plm_base_verbose 0 \
>
> --mca plm rsh \
>
> /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest
>
> A few things of note on your parameters:
>
>
>- With the "btl" parameter, you're specifically telling Open MPI to
>skip using the openib​ component.  But then you pass in 2 btl_openib_*​
>parameters, anyway (which will just be ignored, because you told Open MPI
>to not use openib​).  This is harmless, but worth mentioning.
>- You explicitly set plm_base_verbose​ to 0, but 0 is the default
>value.  Again, this is harmless (i.e., it's unnecessary because you're
>setting it to the same as the default value), but I thought I'd point it
>out.
>- You're explicitly setting the plm​ value (Program Launch Module –
>i.e., how Open MPI launches remote executables), but you're not specifying
>any remote hosts.  In this local-only case, Open MPI will effectively just
>fork/exec the process locally.  So specifying the plm​ isn't needed.
>Again, harmless, but I thought I'd point it out.
>- We always advise against --allow-run-as-root​.  If you have a strong
>need for it, ok – that's what it's there for, after all – but it definitely
>isn't recommended.
>
>
> I suspect you have some environment variables and/or a config file that is
> telling Open MPI to set the pml​ to ucx​ (perhaps from your environment
> modules?).  Look in your environment for OMPI_mca_pml=ucx​, or something
> similar.
>
> That being said, the command line always trumps environment variables and
> config files in Open MPI.  So what Howard said – mpirun --mca pml '^ucx'
> ...​ – will effectively override any env variable or config file
> specifications telling Open MPI to use the UCX PML.
>
> And all *that*​ being said, the full error message says that the UCX PML
> may not have been able to be loaded.  That might mean that the UCX PML
> isn't present (i.e., that plugin literally isn't present in the
> filesystem), but it may also mean that the plugin was present and Open MPI
> tried to load it, and failed.  This typically means that shared library
> dependencies of that plugin weren't able to be loaded by the linker, so the
> linker gave up and simply told Open MPI "sorry, I can't dynamically open
> that plugin."  Open MPI basically just passed on the error to you.
>
> To figure out which is the case, you might want to run with mpirun --mca
> mca_component_show_load_errors 1 ...​  This will tell Open MPI to display
> errors when it tries to load a plugin, but fails (e.g, due to the linker
> not being able t

Re: [OMPI users] [EXTERNAL] Help deciphering error message

2024-03-08 Thread John Hearns via users
We always advise against --allow-run-as-root

Just saying that in my experience it is common to run IO tests as root.
I agree - run the thing as a normal user and if that means a few seconds
extra work to set up permissionon on the target filesystem it is time well
spent.

On Fri, 8 Mar 2024 at 14:55, Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> (sorry this is so long – it's a bunch of explanations followed by 2
> suggestions at the bottom)
>
> One additional thing worth mentioning is that your mpirun command line
> does not seem to explicitly be asking for the "ucx" PML component, but the
> error message you're getting indicates that you specifically asked for the
> "ucx" PML.  Here's your command line, line-broken and re-ordered for ease
> of reading:
>
> /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun \
>
> -np 1 \
>
> -map-by ppr:1:node \
>
> --allow-run-as-root \
>
> --mca btl '^openib' \
> --mca btl_openib_warn_default_gid_prefix 0 \
>
> --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 \
>
> --mca plm_base_verbose 0 \
>
> --mca plm rsh \
>
> /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest
>
> A few things of note on your parameters:
>
>
>- With the "btl" parameter, you're specifically telling Open MPI to
>skip using the openib​ component.  But then you pass in 2 btl_openib_*​
>parameters, anyway (which will just be ignored, because you told Open MPI
>to not use openib​).  This is harmless, but worth mentioning.
>- You explicitly set plm_base_verbose​ to 0, but 0 is the default
>value.  Again, this is harmless (i.e., it's unnecessary because you're
>setting it to the same as the default value), but I thought I'd point it
>out.
>- You're explicitly setting the plm​ value (Program Launch Module –
>i.e., how Open MPI launches remote executables), but you're not specifying
>any remote hosts.  In this local-only case, Open MPI will effectively just
>fork/exec the process locally.  So specifying the plm​ isn't needed.
>Again, harmless, but I thought I'd point it out.
>- We always advise against --allow-run-as-root​.  If you have a strong
>need for it, ok – that's what it's there for, after all – but it definitely
>isn't recommended.
>
>
> I suspect you have some environment variables and/or a config file that is
> telling Open MPI to set the pml​ to ucx​ (perhaps from your environment
> modules?).  Look in your environment for OMPI_mca_pml=ucx​, or something
> similar.
>
> That being said, the command line always trumps environment variables and
> config files in Open MPI.  So what Howard said – mpirun --mca pml '^ucx'
> ...​ – will effectively override any env variable or config file
> specifications telling Open MPI to use the UCX PML.
>
> And all *that*​ being said, the full error message says that the UCX PML
> may not have been able to be loaded.  That might mean that the UCX PML
> isn't present (i.e., that plugin literally isn't present in the
> filesystem), but it may also mean that the plugin was present and Open MPI
> tried to load it, and failed.  This typically means that shared library
> dependencies of that plugin weren't able to be loaded by the linker, so the
> linker gave up and simply told Open MPI "sorry, I can't dynamically open
> that plugin."  Open MPI basically just passed on the error to you.
>
> To figure out which is the case, you might want to run with mpirun --mca
> mca_component_show_load_errors 1 ...​  This will tell Open MPI to display
> errors when it tries to load a plugin, but fails (e.g, due to the linker
> not being able to find dependent libraries).  This is probably what I would
> do first – you might find that the dgx-14 node either is missing some
> libraries, or your LD_LIBRARY_PATH is not set correctly to find dependent
> libraries, or somesuch.
>
> Hope that helps!
>
>
> --
> *From:* users  on behalf of Pritchard
> Jr., Howard via users 
> *Sent:* Thursday, March 7, 2024 3:01 PM
> *To:* Open MPI Users 
> *Cc:* Pritchard Jr., Howard 
> *Subject:* Re: [OMPI users] [EXTERNAL] Help deciphering error message
>
>
> Hello Jeffrey,
>
>
>
> A couple of things to try first.
>
>
>
> Try running without UCX.  Add –-mca pml ^ucx to the mpirun command line.
> If the app functions without ucx, then the next thing is to see what may be
> going wrong with UCX and the Open MPI components that use it.
>
>
>
> You may want to set the UCX_LOG_LEVEL environment variable to see if Open
> MPI’s UCX PML component is actually able to initialize UCX and start trying
> to use it.
>
>
>
> See https://openucx.readthedocs.io/en/master/faq.html  for an example to
> do this using mpirun and the type of output you should be getting.
>
>
>
> Another simple thing to try is
>
>
>
> mpirun -np 1 ucx_info -v
>
>
>
> and see it you get something like this back on stdout:
>
>  Library version: 1.14.0
>
> # Library path: /usr/lib64/li