Pierre, thanks for the additional information. This is scary. If this is really an OpenMPI bug, we don't have a petsc test to catch it and currently have no clue what went wrong.
--Junchao Zhang On Wed, Sep 18, 2024 at 5:09 AM LEDAC Pierre <[email protected]> wrote: > Junchao, I just tried PETSc last version in main branch with OpenMPI > 4.1.5-cuda. I still get KSP_DIVERGED on > > some parallel calculations. Solved when moving to OpenMPI 5.0.5-cuda. > > > Pierre LEDAC > Commissariat à l’énergie atomique et aux énergies alternatives > Centre de SACLAY > DES/ISAS/DM2S/SGLS/LCAN > Bâtiment 451 – point courrier n°43 > F-91191 Gif-sur-Yvette > +33 1 69 08 04 03 > +33 6 83 42 05 79 > ------------------------------ > *De :* LEDAC Pierre > *Envoyé :* mardi 17 septembre 2024 19:43:32 > *À :* Junchao Zhang > *Cc :* petsc-users; ROUMET Elie > *Objet :* RE: [petsc-users] [MPI GPU Aware] KSP_DIVERGED > > > Yes. Only OpenMPI 5.0.5 with Petsc 3.20. > > > Pierre LEDAC > Commissariat à l’énergie atomique et aux énergies alternatives > Centre de SACLAY > DES/ISAS/DM2S/SGLS/LCAN > Bâtiment 451 – point courrier n°43 > F-91191 Gif-sur-Yvette > +33 1 69 08 04 03 > +33 6 83 42 05 79 > ------------------------------ > *De :* Junchao Zhang <[email protected]> > *Envoyé :* mardi 17 septembre 2024 18:09:44 > *À :* LEDAC Pierre > *Cc :* petsc-users; ROUMET Elie > *Objet :* Re: [petsc-users] [MPI GPU Aware] KSP_DIVERGED > > Did you "fix" the problem with OpenMPI 5, but keep petsc unchanged (ie., > still 3.20)? > > --Junchao Zhang > > > On Tue, Sep 17, 2024 at 9:47 AM LEDAC Pierre <[email protected]> wrote: > >> Thanks Satish, and nice guess for OpenMPI 5 ! >> >> >> It seems it solves the issue (at least on my GPU box where I reproduced >> the issue with 8 MPI ranks with OpenMPI 4.x). >> >> >> Unhappily, all the clusters we currently use have no module with OpenMPI >> 5.x. Seems I need to build it to really confirm. >> >> >> Probably we will prevent users from configuring our code with >> OpenMPI-cuda 4.x cause it is really a weird bug. >> >> >> Pierre LEDAC >> Commissariat à l’énergie atomique et aux énergies alternatives >> Centre de SACLAY >> DES/ISAS/DM2S/SGLS/LCAN >> Bâtiment 451 – point courrier n°43 >> F-91191 Gif-sur-Yvette >> +33 1 69 08 04 03 >> +33 6 83 42 05 79 >> ------------------------------ >> *De :* Satish Balay <[email protected]> >> *Envoyé :* mardi 17 septembre 2024 15:39:22 >> *À :* LEDAC Pierre >> *Cc :* Junchao Zhang; petsc-users; ROUMET Elie >> *Objet :* Re: [petsc-users] [MPI GPU Aware] KSP_DIVERGED >> >> On Tue, 17 Sep 2024, LEDAC Pierre wrote: >> >> > Thanks all, I will try and report. >> > >> > >> > Last question, if I use "-use_gpu_aware_mpi 0" flag with a MPI GPU >> Aware library, do PETSc >> > >> > disable GPU intra/inter communications and send MPI buffers as usual >> (with extra Device<->Host copies) ? >> >> Yes. >> >> Not: Wrt using MPI that is not GPU-aware - we are changing the default >> behavior - to not require "-use_gpu_aware_mpi 0" flag. >> >> https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/merge_requests/7813__;!!G_uCfscf7eWS!fqIBdGt7byg9rSjen4SjjKyivmWEe_CLpOXOzAM1m3NAKmGoS2jsQQaeww4yYT0379vyu8UdqMoywsMM6JJJS1kdDwvw$ >> >> >> >> Satish >> >> > >> > >> > Thanks, >> > >> > >> > Pierre LEDAC >> > Commissariat à l’énergie atomique et aux énergies alternatives >> > Centre de SACLAY >> > DES/ISAS/DM2S/SGLS/LCAN >> > Bâtiment 451 – point courrier n°43 >> > F-91191 Gif-sur-Yvette >> > +33 1 69 08 04 03 >> > +33 6 83 42 05 79 >> > ________________________________ >> > De : Satish Balay <[email protected]> >> > Envoyé : lundi 16 septembre 2024 18:57:02 >> > À : Junchao Zhang >> > Cc : LEDAC Pierre; [email protected]; ROUMET Elie >> > Objet : Re: [petsc-users] [MPI GPU Aware] KSP_DIVERGED >> > >> > And/Or - try latest OpenMPI [or MPICH] and see if that makes a >> difference. >> > >> > --download-mpich or --download-openmpi with latest petsc should build >> gpu-aware-mpi >> > >> > Satish >> > >> > On Mon, 16 Sep 2024, Junchao Zhang wrote: >> > >> > > Could you try petsc/main to see if the problem persists? >> > > >> > > --Junchao Zhang >> > > >> > > >> > > On Mon, Sep 16, 2024 at 10:51 AM LEDAC Pierre <[email protected]> >> wrote: >> > > >> > > > Hi all, >> > > > >> > > > >> > > > We are using PETSc 3.20 in our code and running succesfully several >> > > > solvers on Nvidia GPU with OpenMPI library which are not GPU aware >> (so I >> > > > need to add the flag -use_gpu_aware_mpi 0). >> > > > >> > > > >> > > > But now, when using OpenMPI GPU Aware library (OpenMPI 4.0.5 ou >> 4.1.5 from >> > > > NVHPC), some parallel calculations failed with *KSP_DIVERGED_ITS* or >> > > > *KSP_DIVERGED_DTOL* >> > > > >> > > > with several configurations. It may run wells on a small test case >> with >> > > > (matrix is symmetric): >> > > > >> > > > >> > > > *-ksp_type cg -pc_type gamg -pc_gamg_type classical* >> > > > >> > > > >> > > > But suddenly with a number of devices for instance bigger than 4 or >> 8, it >> > > > may fail. >> > > > >> > > > >> > > > If I switch to another solver (BiCGstab), it may converge: >> > > > >> > > > >> > > > *-ksp_type bcgs -pc_type gamg -pc_gamg_type classical* >> > > > >> > > > >> > > > The more sensitive cases where it diverges are the following: >> > > > >> > > > >> > > > *-ksp_type cg -pc_type hypre -pc_hypre_type boomeramg * >> > > > >> > > > *-ksp_type cg -pc_type gamg -pc_gamg_type classical* >> > > > >> > > > >> > > > And the *bcgs* turnaroud doesn't work each time... >> > > > >> > > > >> > > > It seems to work without problem with aggregation (at least 128 >> GPUs on my >> > > > simulation): >> > > > >> > > > *-ksp_type cg -pc_type gamg -pc_gamg_type agg* >> > > > >> > > > >> > > > So I guess there is a weird thing happening in my code during the >> solve in >> > > > PETSc with MPI GPU Aware, as all the previous configurations works >> with non >> > > > GPU aware MPI. >> > > > >> > > > >> > > > Here is the -ksp_view log during one fail with the first >> configuration: >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > *KSP Object: () 8 MPI processes type: cg maximum >> iterations=10000, >> > > > nonzero initial guess tolerances: relative=0., absolute=0.0001, >> > > > divergence=10000. left preconditioning using UNPRECONDITIONED >> norm type >> > > > for convergence test PC Object: () 8 MPI processes type: >> hypre HYPRE >> > > > BoomerAMG preconditioning Cycle type V Maximum number >> of levels >> > > > 25 Maximum number of iterations PER hypre call 1 >> Convergence >> > > > tolerance PER hypre call 0. Threshold for strong coupling 0.7 >> > > > Interpolation truncation factor 0. Interpolation: max >> elements per >> > > > row 0 Number of levels of aggressive coarsening 0 >> Number of >> > > > paths for aggressive coarsening 1 Maximum row sums 0.9 >> Sweeps >> > > > down 1 Sweeps up 1 Sweeps on >> coarse 1 >> > > > Relax down l1scaled-Jacobi Relax up >> > > > l1scaled-Jacobi Relax on coarse >> Gaussian-elimination Relax >> > > > weight (all) 1. Outer relax weight (all) 1. >> Maximum size >> > > > of coarsest grid 9 Minimum size of coarsest grid 1 Not >> using >> > > > CF-relaxation Not using more complex smoothers. Measure >> > > > type local Coarsen type PMIS >> Interpolation type >> > > > ext+i SpGEMM type cusparse linear system matrix = >> precond >> > > > matrix: Mat Object: () 8 MPI processes type: mpiaijcusparse >> > > > rows=64000, cols=64000 total: nonzeros=311040, allocated >> > > > nonzeros=311040 total number of mallocs used during MatSetValues >> > > > calls=0 not using I-node (on process 0) routines* >> > > > >> > > > >> > > > I didn't succeed for the moment creating a reproducer with ex.c >> examples... >> > > > >> > > > >> > > > Did you see this kind of behaviour before? >> > > > >> > > > Should I update my PETSc version ? >> > > > >> > > > >> > > > Thanks for any advice, >> > > > >> > > > >> > > > Pierre LEDAC >> > > > Commissariat à l’énergie atomique et aux énergies alternatives >> > > > Centre de SACLAY >> > > > DES/ISAS/DM2S/SGLS/LCAN >> > > > Bâtiment 451 – point courrier n°43 >> > > > F-91191 Gif-sur-Yvette >> > > > +33 1 69 08 04 03 >> > > > +33 6 83 42 05 79 >> > > > >> > > >> > >> >
