Hi Junchao, Victor, I fixed the issue! The issue was with the CPU bindings. Python has a limitation that it only runs on one core. I had to modify the MPI thread launch script to make sure that each python instance is bound to only one physical core.
Thank you both very much for your patience and help! Best, Anna ________________________________ From: Yesypenko, Anna <[email protected]> Sent: Friday, February 2, 2024 2:12 PM To: Junchao Zhang <[email protected]> Cc: Victor Eijkhout <[email protected]>; [email protected] <[email protected]> Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node Hi Junchao, Unfortunately I don't have access to other cuda machines with multiple GPUs. I'm pretty stuck, and I think running on a different machine would help isolate the issue. I'm sharing the python script and the launch script that Victor wrote. There is a comment in the launch script with the mpi command I was using to run the python script. I configured hypre without unified memory. In case it's useful, I also attached the configure.log. If the issue is with petsc/hypre, it may be in the environment variables described here (e.g. HYPRE_MEMORY_DEVICE): https://github.com/hypre-space/hypre/wiki/GPUs Thank you for helping me troubleshoot this issue! Best, Anna ________________________________ From: Junchao Zhang <[email protected]> Sent: Thursday, February 1, 2024 9:07 PM To: Yesypenko, Anna <[email protected]> Cc: Victor Eijkhout <[email protected]>; [email protected] <[email protected]> Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node Hi, Anna, Do you have other CUDA machines to try? If you can share your test, then I will run on Polaris@Argonne to see if it is a petsc/hypre issue. If not, then it must be a GPU-MPI binding problem on TACC. Thanks --Junchao Zhang On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <[email protected]<mailto:[email protected]>> wrote: Hi Victor, Junchao, Thank you for providing the script, it is very useful! There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often). I added some additional environment variables to the script that seem to make the behavior more consistent. export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK ## as Victor suggested export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK The last environment variable is from hypre's documentation on GPUs. In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions? Best, Anna ________________________________ From: Victor Eijkhout <[email protected]<mailto:[email protected]>> Sent: Thursday, February 1, 2024 11:26 AM To: Junchao Zhang <[email protected]<mailto:[email protected]>>; Yesypenko, Anna <[email protected]<mailto:[email protected]>> Cc: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node Only for mvapich2-gdr: #!/bin/bash # Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK case $MV2_COMM_WORLD_LOCAL_RANK in [0]) cpus=0-3 ;; [1]) cpus=64-67 ;; [2]) cpus=72-75 ;; esac numactl --physcpubind=$cpus $@
