Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Mark Adams Wed, 20 Jan 2021 13:09:51 -0800

So I put in a temporary hack to get the first Fieldsplit apply to NOT use
OMP and it sort of works.


Preonly/lu is fine. GMRES calls vector creates/dups in every solve so that
is a big problem. Richardson works except the convergence test gets
confused, presumably because MPI reductions with PETSC_COMM_SELF is not
threadsafe.

One fix for the norms might be to create each subdomain solver with a
different communicator.

I could live with this for my project but maybe I should take a shot at
something better.

Any thoughts?


On Wed, Jan 20, 2021 at 2:28 PM Mark Adams <[email protected]> wrote:

> Barry, Here is a stack trace. [moving this to dev]
>
> I am thinking that KSPSetUp must be called before the OMP loop. That is
> fine with me and I think it is a reasonable general model.
>
> But will it work?
>
> To be specific, just looking at LU I see that SetUp does the numerical
> factorization and the PCSetUp interface skips this after it is called once.
> I would guess that Solve will do the factorization (in the OMP thread/GPU)
> if the data changes (MatAssembly has been called). And that could work.
>
> Doing the factorization on the CPU, in serial, the first time is what I do
> with GPU assembly also. Not perfect but if you can amortize then its OK.
> Then having all subsequent KSP solves work in OMP would be OK.
>
> So I'm going to give this model a try and see where I get. Comments
> welcome.
>
> This is a messy abstraction in general, for instance, you can separated
> #16 gomp_thread_start (xdata=<optimized out>) at
> /sw/summit/gcc/6.4.0/src/gcc-6.4.0/libgomp/team.c:119 (at
> 0x0000200020c4a51c)
> #15 PCApply_FieldSplit._omp_fn.0 () at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1285
> (at 0x0000200000fb2584)
> #14 PetscFSOMPSolve (y=<optimized out>, x=<optimized out>,
> ilink=0x763ee450) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1238
> (at 0x0000200000fb2584)
> #13 KSPSolve (ksp=<optimized out>, b=<optimized out>, x=<optimized out>)
> at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:963
> (at 0x000020000105af78)
> #12 KSPSolve_Private (ksp=0x763ef010, b=0x763f7fb0, x=<optimized out>) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:726
> (at 0x00002000010586b0)
> #11 KSPSetUp (ksp=0x763ef010) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:365
> (at 0x000020000105636c)
> #10 KSPSetUp_GMRES (ksp=0x763ef010) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/impls/gmres/gmres.c:85
> (at 0x00002000010e2168)
> #9 KSPCreateVecs (ksp=0x763ef010, rightn=<optimized out>,
> right=0x200054007bb0, leftn=<optimized out>, left=0x0) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/iterativ.c:971
> (at 0x0000200001063610)
> #8 VecDuplicateVecs (v=<optimized out>, m=<optimized out>, V=<optimized
> out>) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:439
> (at 0x00002000003d3f58)
> #7 VecDuplicateVecs_Default (w=0x76400200, m=<optimized out>,
> V=0x200054007bb0) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:825
> (at 0x00002000003d4500)
> #6 VecDuplicate (v=<optimized out>, newv=<optimized out>) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:369
> (at 0x00002000003d3d14)
> #5 VecDuplicate_SeqCUDA (win=0x76400200, V=0x200054007d70) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:222
> (at 0x00002000003c65b0)
> #4 VecCreateSeqCUDA (comm=<optimized out>, n=<optimized out>,
> v=0x200054007d70) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:213
> (at 0x00002000003c64a4)
> #3 VecSetType (vec=0x200054007db0, method=0x20000132f308 "seqcuda") at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vecreg.c:91
> (at 0x00002000003d7f90)
> #2 VecCreate_SeqCUDA (V=0x200054007db0) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:237
> (at 0x00002000003c6c2c)
> #1 VecCUDAAllocateCheck (v=0x200054007db0) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/
> veccuda2.cu:53 (at 0x000020000072b698)
> #0 PetscOptionsEnd_Private (PetscOptionsObject=0x20003c712078) at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/aoptions.c:521 (at
> 0x000020000026e880)
>
> On Tue, Jan 19, 2021 at 10:46 AM Mark Adams <[email protected]> wrote:
>
>> Well, Summit is down for the day so I moved to NERSc and
>> OMP/FieldSplit/cuSparse/[ILU,LU] seem to be working there. I added
>> PetscInfo lines, below, and the answers with 10 species/threads are perfect.
>>
>> It looks like OMP threads are serialized (see below) but in random
>> order.  You can see that field 0 ("e") calls solve 14 times and converged
>> in 13 iterations.
>>
>> I'm not sure what to make of this. I think I'll try to see if I can see
>> any difference in total run time with 1 and 10 OMP threads.
>>
>>    9 SNES Function norm 4.741380472654e-13
>> [0] MatCUSPARSEGetDeviceMatWrite(): Assemble more than once already
>> [0] PCSetUp(): Setting up PC with same nonzero pattern
>> [0] PCApply_FieldSplit(): thread 2 in field 2
>> [0] PCSetUp(): Setting up PC with same nonzero pattern
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] PCApply_FieldSplit(): thread 7 in field 7
>> [0] PCSetUp(): Setting up PC with same nonzero pattern
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>>   ....
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *[0] PCApply_FieldSplit(): thread 0 in field 0[0] PCSetUp(): Setting up
>> PC with same nonzero pattern[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering():
>> Cuda solve (NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve
>> (NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
>> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)      Linear
>> fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 13*
>> [0] PCApply_FieldSplit(): thread 5 in field 5
>> [0] PCSetUp(): Setting up PC with same nonzero pattern
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>>  ...
>>
>> On Tue, Jan 19, 2021 at 8:07 AM Mark Adams <[email protected]> wrote:
>>
>>>
>>>
>>> On Mon, Jan 18, 2021 at 11:06 PM Barry Smith <[email protected]> wrote:
>>>
>>>>
>>>>   Can valgrind run and help with OpenMP?
>>>>
>>>
>>> I am pretty sure. There is also cuda-memcheck that has the same
>>> semantics that works on GPU code, but I'm not sure how good it is for CPU
>>> code.
>>>
>>>
>>>>
>>>>   You can run in the debugger and find any calls to the options
>>>> checking inside your code block and comment them all out to see if that
>>>> eliminates the problem.
>>>>
>>>
>>> The stack trace does give me the method that it calls the fatal Free in,
>>> so I will try a breakpoint in there. DDT does work with threads but not GPU
>>> code.
>>>
>>>
>>>>
>>>>   Also generically how safe is CUDA inside OpenMP? That is with
>>>> multiple threads calling CUDA stuff?
>>>>
>>>
>>> I recall that the XGC code, which has a lot of OMP, Cuda (and Kokkos)
>>> does this. Not 100% sure.
>>>
>>> I know that they recently had to tear out some OMP loops that they
>>> Kokkos'ized because they had some problem mixing Kokkos-OMP and Cuda so
>>> they reverted back to pure OMP.
>>>
>>>
>>>>
>>>>
>>>>   Barry
>>>>
>>>>
>>>>
>>>> On Jan 18, 2021, at 7:04 PM, Mark Adams <[email protected]> wrote:
>>>>
>>>>
>>>> Added this w/o luck:
>>>>
>>>> #if defined(PETSC_HAVE_CUDA)
>>>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>>>> #if defined(PETSC_HAVE_THREADSAFETY)
>>>>   ierr = PetscCUPMInitializeCheck();CHKERRQ(ierr);
>>>> #endif
>>>> #endif
>>>>
>>>> Do you think I should keep this in or take it out? Seems like a good
>>>> idea and when it all works we can see if we can make it lazy.
>>>>
>>>> 1)  Calling PetscOptions inside threads. I looked quickly at the code
>>>>> and it seems like it should be ok but perhaps not. This is one reason why
>>>>> having stuff like PetscOptionsBegin inside a low-level creation
>>>>> VecCreate_SeqCUDA_Private is normally not done in PETSc. Eventually this
>>>>> needs to be moved or reworked.
>>>>>
>>>>>
>>>> I will try this next. It is hard to see the stack here. I think I will
>>>> put it in ddt and put a breakpoint PetscOptionsEnd_Private. Other ideas
>>>> welcome.
>>>>
>>>> Mark
>>>>
>>>>
>>>>> 2) PetscCUDAInitializeCheck is not thread safe.If it is being call for
>>>>> the first timeby multiple threads there can be trouble. So edit init.c and
>>>>> under
>>>>>
>>>>> #if defined(PETSC_HAVE_CUDA)
>>>>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>>>>> #endif
>>>>>
>>>>> #if defined(PETSC_HAVE_HIP)
>>>>>   ierr = PetscOptionsCheckHIP(logView);CHKERRQ(ierr);
>>>>> #endif
>>>>>
>>>>> put in
>>>>> #if defined thread safety
>>>>> PetscCUPMInitializeCheck
>>>>> #endif
>>>>>
>>>>> this will force the initialize to be done before any threads are used
>>>>>
>>>>
>>>>>
>>>>

Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Reply via email to