So I put in a temporary hack to get the first Fieldsplit apply to NOT use OMP and it sort of works.
Preonly/lu is fine. GMRES calls vector creates/dups in every solve so that is a big problem. Richardson works except the convergence test gets confused, presumably because MPI reductions with PETSC_COMM_SELF is not threadsafe. One fix for the norms might be to create each subdomain solver with a different communicator. I could live with this for my project but maybe I should take a shot at something better. Any thoughts? On Wed, Jan 20, 2021 at 2:28 PM Mark Adams <[email protected]> wrote: > Barry, Here is a stack trace. [moving this to dev] > > I am thinking that KSPSetUp must be called before the OMP loop. That is > fine with me and I think it is a reasonable general model. > > But will it work? > > To be specific, just looking at LU I see that SetUp does the numerical > factorization and the PCSetUp interface skips this after it is called once. > I would guess that Solve will do the factorization (in the OMP thread/GPU) > if the data changes (MatAssembly has been called). And that could work. > > Doing the factorization on the CPU, in serial, the first time is what I do > with GPU assembly also. Not perfect but if you can amortize then its OK. > Then having all subsequent KSP solves work in OMP would be OK. > > So I'm going to give this model a try and see where I get. Comments > welcome. > > This is a messy abstraction in general, for instance, you can separated > #16 gomp_thread_start (xdata=<optimized out>) at > /sw/summit/gcc/6.4.0/src/gcc-6.4.0/libgomp/team.c:119 (at > 0x0000200020c4a51c) > #15 PCApply_FieldSplit._omp_fn.0 () at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1285 > (at 0x0000200000fb2584) > #14 PetscFSOMPSolve (y=<optimized out>, x=<optimized out>, > ilink=0x763ee450) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1238 > (at 0x0000200000fb2584) > #13 KSPSolve (ksp=<optimized out>, b=<optimized out>, x=<optimized out>) > at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:963 > (at 0x000020000105af78) > #12 KSPSolve_Private (ksp=0x763ef010, b=0x763f7fb0, x=<optimized out>) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:726 > (at 0x00002000010586b0) > #11 KSPSetUp (ksp=0x763ef010) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:365 > (at 0x000020000105636c) > #10 KSPSetUp_GMRES (ksp=0x763ef010) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/impls/gmres/gmres.c:85 > (at 0x00002000010e2168) > #9 KSPCreateVecs (ksp=0x763ef010, rightn=<optimized out>, > right=0x200054007bb0, leftn=<optimized out>, left=0x0) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/iterativ.c:971 > (at 0x0000200001063610) > #8 VecDuplicateVecs (v=<optimized out>, m=<optimized out>, V=<optimized > out>) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:439 > (at 0x00002000003d3f58) > #7 VecDuplicateVecs_Default (w=0x76400200, m=<optimized out>, > V=0x200054007bb0) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:825 > (at 0x00002000003d4500) > #6 VecDuplicate (v=<optimized out>, newv=<optimized out>) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:369 > (at 0x00002000003d3d14) > #5 VecDuplicate_SeqCUDA (win=0x76400200, V=0x200054007d70) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:222 > (at 0x00002000003c65b0) > #4 VecCreateSeqCUDA (comm=<optimized out>, n=<optimized out>, > v=0x200054007d70) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:213 > (at 0x00002000003c64a4) > #3 VecSetType (vec=0x200054007db0, method=0x20000132f308 "seqcuda") at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vecreg.c:91 > (at 0x00002000003d7f90) > #2 VecCreate_SeqCUDA (V=0x200054007db0) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:237 > (at 0x00002000003c6c2c) > #1 VecCUDAAllocateCheck (v=0x200054007db0) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/ > veccuda2.cu:53 (at 0x000020000072b698) > #0 PetscOptionsEnd_Private (PetscOptionsObject=0x20003c712078) at > /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/aoptions.c:521 (at > 0x000020000026e880) > > On Tue, Jan 19, 2021 at 10:46 AM Mark Adams <[email protected]> wrote: > >> Well, Summit is down for the day so I moved to NERSc and >> OMP/FieldSplit/cuSparse/[ILU,LU] seem to be working there. I added >> PetscInfo lines, below, and the answers with 10 species/threads are perfect. >> >> It looks like OMP threads are serialized (see below) but in random >> order. You can see that field 0 ("e") calls solve 14 times and converged >> in 13 iterations. >> >> I'm not sure what to make of this. I think I'll try to see if I can see >> any difference in total run time with 1 and 10 OMP threads. >> >> 9 SNES Function norm 4.741380472654e-13 >> [0] MatCUSPARSEGetDeviceMatWrite(): Assemble more than once already >> [0] PCSetUp(): Setting up PC with same nonzero pattern >> [0] PCApply_FieldSplit(): thread 2 in field 2 >> [0] PCSetUp(): Setting up PC with same nonzero pattern >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] PCApply_FieldSplit(): thread 7 in field 7 >> [0] PCSetUp(): Setting up PC with same nonzero pattern >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> .... >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *[0] PCApply_FieldSplit(): thread 0 in field 0[0] PCSetUp(): Setting up >> PC with same nonzero pattern[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): >> Cuda solve (NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve >> (NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0] >> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) Linear >> fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 13* >> [0] PCApply_FieldSplit(): thread 5 in field 5 >> [0] PCSetUp(): Setting up PC with same nonzero pattern >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO) >> ... >> >> On Tue, Jan 19, 2021 at 8:07 AM Mark Adams <[email protected]> wrote: >> >>> >>> >>> On Mon, Jan 18, 2021 at 11:06 PM Barry Smith <[email protected]> wrote: >>> >>>> >>>> Can valgrind run and help with OpenMP? >>>> >>> >>> I am pretty sure. There is also cuda-memcheck that has the same >>> semantics that works on GPU code, but I'm not sure how good it is for CPU >>> code. >>> >>> >>>> >>>> You can run in the debugger and find any calls to the options >>>> checking inside your code block and comment them all out to see if that >>>> eliminates the problem. >>>> >>> >>> The stack trace does give me the method that it calls the fatal Free in, >>> so I will try a breakpoint in there. DDT does work with threads but not GPU >>> code. >>> >>> >>>> >>>> Also generically how safe is CUDA inside OpenMP? That is with >>>> multiple threads calling CUDA stuff? >>>> >>> >>> I recall that the XGC code, which has a lot of OMP, Cuda (and Kokkos) >>> does this. Not 100% sure. >>> >>> I know that they recently had to tear out some OMP loops that they >>> Kokkos'ized because they had some problem mixing Kokkos-OMP and Cuda so >>> they reverted back to pure OMP. >>> >>> >>>> >>>> >>>> Barry >>>> >>>> >>>> >>>> On Jan 18, 2021, at 7:04 PM, Mark Adams <[email protected]> wrote: >>>> >>>> >>>> Added this w/o luck: >>>> >>>> #if defined(PETSC_HAVE_CUDA) >>>> ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr); >>>> #if defined(PETSC_HAVE_THREADSAFETY) >>>> ierr = PetscCUPMInitializeCheck();CHKERRQ(ierr); >>>> #endif >>>> #endif >>>> >>>> Do you think I should keep this in or take it out? Seems like a good >>>> idea and when it all works we can see if we can make it lazy. >>>> >>>> 1) Calling PetscOptions inside threads. I looked quickly at the code >>>>> and it seems like it should be ok but perhaps not. This is one reason why >>>>> having stuff like PetscOptionsBegin inside a low-level creation >>>>> VecCreate_SeqCUDA_Private is normally not done in PETSc. Eventually this >>>>> needs to be moved or reworked. >>>>> >>>>> >>>> I will try this next. It is hard to see the stack here. I think I will >>>> put it in ddt and put a breakpoint PetscOptionsEnd_Private. Other ideas >>>> welcome. >>>> >>>> Mark >>>> >>>> >>>>> 2) PetscCUDAInitializeCheck is not thread safe.If it is being call for >>>>> the first timeby multiple threads there can be trouble. So edit init.c and >>>>> under >>>>> >>>>> #if defined(PETSC_HAVE_CUDA) >>>>> ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr); >>>>> #endif >>>>> >>>>> #if defined(PETSC_HAVE_HIP) >>>>> ierr = PetscOptionsCheckHIP(logView);CHKERRQ(ierr); >>>>> #endif >>>>> >>>>> put in >>>>> #if defined thread safety >>>>> PetscCUPMInitializeCheck >>>>> #endif >>>>> >>>>> this will force the initialize to be done before any threads are used >>>>> >>>> >>>>> >>>>
