Hi Akshay, I'm building both UCX and OpenMPI as you mention. The portions of the script read:
./configure --prefix=/usr/local/ucx-cuda-install --with-cuda=/usr/local/cuda-10.1 --with-gdrcopy=/home/odyhpc/gdrcopy --disable-numa sudo make install & ./configure --with-cuda=/usr/local/cuda-10.1 --with-cuda-libdir=/usr/local/cuda-10.1/lib64 --with-ucx=/usr/local/ucx-cuda-install --prefix=/opt/openmpi sudo make all install As far as the job submission, I have tried several combinations with different MCAs (yesterday I forgot to include '--mca pml ucx' flag as it had made no difference in the past). I just tried your suggested syntax (mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H) with the same results. The latency times are of the same order no matter which flags I include. As far as checking GPU usage, I'm not familiar with 'nvprof' and simply using the basic continuous info (nvidia-smi -l). I'm trying all of this in a cloud environment, and my suspicion is that there might be some interference (maybe because of some virtualization component) but cannot pinpoint the cause. Thanks, Arturo From: Akshay Venkatesh <akshay.v.3...@gmail.com> Sent: Friday, September 06, 2019 11:14 AM To: Open MPI Users <users@lists.open-mpi.org> Cc: Joshua Ladd <jladd.m...@gmail.com>; Arturo Fernandez <afernan...@odyhpc.com> Subject: Re: [OMPI users] CUDA-aware codes not using GPU Hi, Arturo. Usually, for OpenMPI+UCX we use the following recipe for UCX: ./configure --prefix=/path/to/ucx-cuda-install --with-cuda=/usr/local/cuda --with-gdrcopy=/usr make -j install then OpenMPI: ./configure --with-cuda=/usr/local/cuda --with-ucx=/path/to/ucx-cuda-install make -j install Can you run with the following to see if it helps: mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H There are details here that may be useful: https://www.open-mpi.org/faq/?category=runcuda#run-ompi-cuda-ucx Also, note that for short messages D->H path for inter-node may not involve call CUDA API (if you're using nvprof to detect CUDA activity) because GPUDirectRDMA path and gdrcopy is used. On Fri, Sep 6, 2019 at 7:36 AM Arturo Fernandez via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Josh, Thank you. Yes, I built UCX with CUDA and gdrcopy support. I also had to disable numa (--disable-numa) as requested during the installation. AFernandez Joshua Ladd wrote Did you build UCX with CUDA support (--with-cuda) ? Josh On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Hello OpenMPI Team, I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU and the code runs on the CPUs. I've tried different software but will focus on the OSU benchmarks (collective and pt2pt communications). Let me provide some data about the configuration of the system: -OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox card with MOFED a few days ago and found the same issue) -CUDA v10.1 -gdrcopy v1.3 -UCX 1.6.0 -OpenMPI 4.0.1 Everything looks like good (CUDA programs work fine, MPI programs run on the CPUs without any problem), and the ompi_info outputs what I was expecting (but maybe I'm missing something): mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with_cuda_support mca:mpi:base:param:mpi_built_with_cuda_support:value:true mca:mpi:base:param:mpi_built_with_cuda_support:source:default mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only mca:mpi:base:param:mpi_built_with_cuda_support:level:4 mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU buffer support is built into library or not mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no mca:mpi:base:param:mpi_built_with_cuda_support:type:bool mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_with_cuda_support mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false The available btls are the usual self, openib, tcp & vader plus smcuda, uct & usnic. The full output from ompi_info is attached. If I try the flag '--mca opal_cuda_verbose 10,' it doesn't output anything, which seems to agree with the lack of GPU use. If I try with '--mca btl smcuda,' it makes no difference. I have also tried to specify the program to use host and device (e.g. mpirun -np 2 ./osu_latency D H) but the same result. I am probably missing something but not sure where else to look at or what else to try. Thank you, AFernandez _______________________________________________ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users> _______________________________________________ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users -- -Akshay NVIDIA
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users