Hello Jeffrey,

A couple of things to try first.

Try running without UCX.  Add –-mca pml ^ucx to the mpirun command line.  If 
the app functions without ucx, then the next thing is to see what may be going 
wrong with UCX and the Open MPI components that use it.

You may want to set the UCX_LOG_LEVEL environment variable to see if Open MPI’s 
UCX PML component is actually able to initialize UCX and start trying to use it.

See https://openucx.readthedocs.io/en/master/faq.html  for an example to do 
this using mpirun and the type of output you should be getting.

Another simple thing to try is

mpirun -np 1 ucx_info -v


and see it you get something like this back on stdout:

 Library version: 1.14.0

# Library path: /usr/lib64/libucs.so.0

# API headers version: 1.14.0

# Git branch '', revision f8877c5

# Configured with: --build=aarch64-redhat-linux-gnu 
--host=aarch64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking 
--prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin 
--sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include 
--libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var 
--sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info 
--disable-optimizations --disable-logging --disable-debug --disable-assertions 
--enable-mt --disable-params-check --without-go --without-java --enable-cma 
--with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm 
--without-rocm --with-xpmem --without-fuse3 --without-ugni 
--with-cuda=/usr/local/cuda-11.7
Are you running the mpirun command on dgx-14?  If that’s a different host a 
likely problem is that for some reason, the information in your ucx/1.10.1 is 
not getting picked up on dgx-14.

One other thing, if the module UCX module name is indicating the version of 
UCX, its rather old.  I’d suggest, if possible, updating to a newer version, 
like 1.14.1 or newer.  There are many enhancements in more recent versions of 
UCX for GPU support and I would bet you’d want that for your DGX boxes.

Howard

From: users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey Layton via 
users <users@lists.open-mpi.org>
Reply-To: Open MPI Users <users@lists.open-mpi.org>
Date: Thursday, March 7, 2024 at 11:53 AM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Jeffrey Layton <layto...@gmail.com>
Subject: [EXTERNAL] [OMPI users] Help deciphering error message

Good afternoon,

I'm getting an error message I'm not sure how to use to debug an issue. I'll 
try to give you all of the pertinent about the setup, but I didn't build the 
system nor install the software. It's an NVIDIA SuperPod system with Base 
Command Manager 10.0.

I'm building IOR but I'm really interested in mdtest. "module list" says I'm 
using the following modules:

gcc/64/4.1.5a1
ucx/1.10.1
openmpi4/gcc/4.1.5

There are no problems building the code.

I'm using Slurm to run mdtest using a script. The output from the script and 
Slurm is the following (the command to run it is included).


/cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 1 -map-by 
ppr:1:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca 
btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0
 --mca plm rsh /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d 
/raid/bcm/mdtest
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      dgx-14
Framework: pml
Component: ucx
--------------------------------------------------------------------------
[dgx-14:4055623] [[42340,0],0] ORTE_ERROR_LOG: Data unpack would read past end 
of buffer in file util/show_help.c at line 501
[dgx-14:4055632] *** An error occurred in MPI_Init
[dgx-14:4055632] *** reported by process [2774794241,0]
[dgx-14:4055632] *** on a NULL communicator
[dgx-14:4055632] *** Unknown error
[dgx-14:4055632] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[dgx-14:4055632] ***    and potentially your MPI job)


Any pointers/help is greatly appreciated.

Thanks!

Jeff




[Image removed by 
sender.]<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>
Virus-free.www.avast.com<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>

Reply via email to