Hi Junchao,
Thank you for your suggestion, you're right that binding MPI ranks to GPUs
seems to be the issue.
I looked at the TACC documentation, and I'm not sure they provide this utility.
I'm trying to set the CUDA_VISIBLE_DEVICES environment variable according to
the MPI rank.
This works sometimes now! The environment variables are set properly, but it
still fails with the same error half the time.
How do I know that hypre is binding MPI ranks to GPUs properly? The error
originates from a call to hypre.
I also tried to set the environment variable (using mpi4py) before importing
PETSc, but this doesn't seem to work either.
Here is the preamble I added to the top of the script. I'm running on a single
node with 3 GPUs.
``
import numpy,petsc4py,sys,os,time
from time import time
petsc4py.init(sys.argv)
from petsc4py import PETSc
comm = PETSc.COMM_WORLD
os.environ['CUDA_VISIBLE_DEVICES'] = "%d" % comm.Get_rank()
PETSc.Sys.syncPrint("\t Processor %d of %d gets GPU %d"%\
(comm.Get_rank(),comm.Get_size(),comm.Get_rank()),comm=comm,flush=True)
comm.Barrier()
### Petsc Matrix initialization here
### I confirm that the matrix is partitioned into indices as I expect
PETSc.Sys.syncPrint("\t Processor %d with GPU %s gets indices %d:%d"\
%(comm.Get_rank(),os.environ['CUDA_VISIBLE_DEVICES'],rstart,rend),flush=True,comm=comm)
``
When the script fails, I get the following stack trace.
``
TACC: Starting up job 1491828
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
Processor 0 of 3 gets GPU 0
Processor 1 of 3 gets GPU 1
Processor 2 of 3 gets GPU 2
Processor 0 with GPU 0 gets indices 0:166667
Processor 1 with GPU 1 gets indices 166667:333334
Processor 2 with GPU 2 gets indices 333334:500000
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: --------------------- Stack Frames
------------------------------------
[0]PETSC ERROR: The line numbers in the error traceback are not always exact.
[0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
[0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
[0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
[0]PETSC ERROR: #4 MatAssemblyEnd() at
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
[0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
[0]PETSC ERROR: #6 MatConvert() at
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
[0]PETSC ERROR: #7 PCSetUp_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
[0]PETSC ERROR: #8 PCSetUp() at
/work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
[0]PETSC ERROR: #9 KSPSetUp() at
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
[0]PETSC ERROR: #10 KSPSolve_Private() at
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
[0]PETSC ERROR: #11 KSPSolve() at
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
``
________________________________
From: Junchao Zhang <[email protected]>
Sent: Wednesday, January 31, 2024 5:36 PM
To: Yesypenko, Anna <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a
node
Hi Anna,
Since you said "The code works with pc-type hypre on a single GPU.", I was
wondering if this is a CUDA devices to MPI ranks binding problem.
You can search TACC documentation to find how its job scheduler binds GPUs to
MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES environment
variable)
Please follow up if you could not solve it.
Thanks.
--Junchao Zhang
On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna
<[email protected]<mailto:[email protected]>> wrote:
Dear Petsc devs,
I'm encountering an error running hypre on a single node with multiple GPUs.
The issue is in the setup phase. I'm trying to troubleshoot, but don't know
where to start.
Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck
available in python?
How do I verify that GPUs are assigned properly to each MPI process? In this
case, I have 3 tasks and 3 GPUs.
The code works with pc-type hypre on a single GPU.
Any suggestions are appreciated!
Below is the error trace:
``
TACC: Starting up job 1490124
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: --------------------- Stack Frames
------------------------------------
[0]PETSC ERROR: The line numbers in the error traceback are not always exact.
[0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
[0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
[0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
[0]PETSC ERROR: #4 MatAssemblyEnd() at
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
[0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
[0]PETSC ERROR: #6 MatConvert() at
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
[0]PETSC ERROR: #7 PCSetUp_HYPRE() at
/work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
[0]PETSC ERROR: #8 PCSetUp() at
/work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
[0]PETSC ERROR: #9 KSPSetUp() at
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
[0]PETSC ERROR: #10 KSPSolve_Private() at
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
[0]PETSC ERROR: #11 KSPSolve() at
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
``
Below is a minimum working example:
``
import numpy,petsc4py,sys,time
petsc4py.init(sys.argv)
from petsc4py import PETSc
from time import time
n = int(5e5);
comm = PETSc.COMM_WORLD
pA = PETSc.Mat(comm=comm)
pA.create(comm=comm)
pA.setSizes((n,n))
pA.setType(PETSc.Mat.Type.AIJ)
pA.setPreallocationNNZ(3)
rstart,rend=pA.getOwnershipRange()
print("\t Processor %d of %d gets indices
%d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend))
if (rstart == 0):
pA.setValue(0,0,2); pA.setValue(0,1,-1)
if (rend == n):
pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2)
for index in range(rstart,rend):
if (rstart > 0):
pA.setValue(index,index-1,-1)
pA.setValue(index,index,2)
if (rend < n):
pA.setValue(index,index+1,-1)
pA.assemble()
pA = pA.convert(mat_type='aijcusparse')
px,pb = pA.createVecs()
pb.set(1.0); px.set(1.0)
ksp = PETSc.KSP().create()
ksp.setOperators(pA)
ksp.setConvergenceHistory()
ksp.setType('cg')
ksp.getPC().setType('hypre')
ksp.setTolerances(rtol=1e-10)
ksp.solve(pb, px) # error is generated here
``
Best,
Anna