Hi Junchao, Victor,

I fixed the issue! The issue was with the CPU bindings. Python has a limitation 
that it only runs on one core.
I had to modify the MPI thread launch script to make sure that each python 
instance is bound to only one physical core.

Thank you both very much for your patience and help!

Best,
Anna
________________________________
From: Yesypenko, Anna <[email protected]>
Sent: Friday, February 2, 2024 2:12 PM
To: Junchao Zhang <[email protected]>
Cc: Victor Eijkhout <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a 
node

Hi Junchao,

Unfortunately I don't have access to other cuda machines with multiple GPUs.
I'm pretty stuck, and I think running on a different machine would help isolate 
the issue.

I'm sharing the python script and the launch script that Victor wrote.
There is a comment in the launch script with the mpi command I was using to run 
the python script.
I configured hypre without unified memory. In case it's useful, I also attached 
the configure.log.

If the issue is with petsc/hypre, it may be in the environment variables 
described here (e.g. HYPRE_MEMORY_DEVICE):
https://github.com/hypre-space/hypre/wiki/GPUs

Thank you for helping me troubleshoot this issue!
Best,
Anna






________________________________
From: Junchao Zhang <[email protected]>
Sent: Thursday, February 1, 2024 9:07 PM
To: Yesypenko, Anna <[email protected]>
Cc: Victor Eijkhout <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a 
node

Hi, Anna,
  Do you have other CUDA machines to try?  If you can share your test, then I 
will run on Polaris@Argonne to see if it is a petsc/hypre issue.  If not, then 
it must be a GPU-MPI binding problem on TACC.

  Thanks
--Junchao Zhang


On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna 
<[email protected]<mailto:[email protected]>> wrote:
Hi Victor, Junchao,

Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the 
error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make 
the behavior more consistent.

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK

The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you 
have any other thoughts or suggestions?

Best,
Anna

________________________________
From: Victor Eijkhout 
<[email protected]<mailto:[email protected]>>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <[email protected]<mailto:[email protected]>>; 
Yesypenko, Anna <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a 
node


Only for mvapich2-gdr:



#!/bin/bash

# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch 
./bin



export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK

case $MV2_COMM_WORLD_LOCAL_RANK in

        [0]) cpus=0-3 ;;

        [1]) cpus=64-67 ;;

        [2]) cpus=72-75 ;;

esac



numactl --physcpubind=$cpus $@


Reply via email to