Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Pierre Jolivet Sun, 24 Jan 2021 08:43:24 -0800


> On 24 Jan 2021, at 4:54 PM, Mark Adams <[email protected]> wrote:
> 
> Hi Sherry, I have this running with OMP, with cuSparse solves (PETSc CPU 
> factorizations)
> 
> Building SuperLU_dist w/o _OPENMP was not easy for me.


Expanding on Sherry’s answer on Thu Jan 21, it should be as easy as adding to 
your configure script 
'--download-superlu-cmake-arguments=-Denable_openmp=FALSE' ' 
--download-superlu_dist-cmake-arguments=-Denable_openmp=FALSE'.
Is that not easy enough, or is it not working?

Thanks,
Pierre

> We need to get a better way to do this. (Satish or Barry?)
> 
> SuperLU works with one thread and two subdomains. With two threads I see this 
> (appended). So this seems to be working in that before it was hanging.
> 
> I set the solver up so that it does not use threads the first time it is 
> called so that solvers can get any lazy allocations done in serial. This is 
> just to be safe in that we do not use a Krylov method here and I don't 
> believe "preonly" allocates any work vectors, and SuperLU does the symbolic 
> factorizations without threads.
> 
> Let me know how you want to proceed.
> 
> Thanks,
> Mark
> 
> ijcusparse -dm_vec_type cuda' NC=2 |g energy
>   0) species-0: charge density= -1.6022862392985e+01 z-momentum= 
> -3.4369550192576e-19 energy=  9.6063873494138e+04
>   0) species-1: charge density=  1.6029950760009e+01 z-momentum= 
> -2.7844197929124e-18 energy=  9.6333444502318e+04
>  0) Total: charge density=  7.0883670236874e-03, momentum= 
> -3.1281152948382e-18, energy=  1.9239731799646e+05 (m_i[0]/m_e = 1835.47, 92 
> cells)
> ex2: 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/externalpackages/git.superlu_dist/SRC/dSchCompUdt-cuda.c:157:
>  pdgstrf: Assertion `jjj-1<nub' failed.
> [h16n13:21073] *** Process received signal ***
> [h16n13:21073] Signal: Aborted (6)
> [h16n13:21073] Signal code: User function (kill, sigsend, abort, etc.) (0)
> [h16n13:21073] [ 0] [0x2000000504d8]
> [h16n13:21073] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x200020ef2094]
> [h16n13:21073] [ 2] /lib64/libc.so.6(+0x356d4)[0x200020ee56d4]
> [h16n13:21073] [ 3] /lib64/libc.so.6(__assert_fail+0x64)[0x200020ee57c4]
> [h16n13:21073] [ 4] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libsuperlu_dist.so.6(pdgstrf+0x3848)[0x2000022fe5d8]
> [h16n13:21073] [ 5] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libsuperlu_dist.so.6(pdgssvx+0x1220)[0x2000022dc4a8]
> [h16n13:21073] [ 6] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x9aff28)[0x200000a9ff28]
> [h16n13:21073] [ 7] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(MatLUFactorNumeric+0x144)[0x2000007d273c]
> [h16n13:21073] [ 8] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0xecffc4)[0x200000fbffc4]
> [h16n13:21073] [ 9] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PCSetUp+0x134)[0x20000107dd38]
> [h16n13:21073] [10] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPSetUp+0x9f8)[0x2000010b272c]
> [h16n13:21073] [11] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0xfc46f0)[0x2000010b46f0]
> [h16n13:21073] [12] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPSolve+0x20)[0x2000010b6fb8]
> [h16n13:21073] [13] 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0xf1e4f8)[0x20000100e4f8]
> [h16n13:21073] [14] 
> /sw/summit/gcc/6.4.0/lib64/libgomp.so.1(+0x1a51c)[0x200020e2a51c]
> [h16n13:21073] [15] /lib64/libpthread.so.0(+0x8b94)[0x200020e78b94]
> [h16n13:21073] [16] /lib64/libc.so.6(clone+0xe4)[0x200020fd85f4]
> [h16n13:21073] *** End of error message ***
> ERROR:  One or more process (first noticed rank 0) terminated with signal 6 
> (core dumped)
> make: [runasm] Error 134 (ignored)
> 
> On Thu, Jan 21, 2021 at 11:57 PM Xiaoye S. Li <[email protected] 
> <mailto:[email protected]>> wrote:
> All the OpenMP calls are surrounded by
> 
> #ifdef _OPENMP
> ...
> #endif
> 
> You can disable openmp during Cmake installation, with the following:
>     -Denable_openmp=FALSE
> (the default is true)
> 
> (I think Satish knows how to do this with PETSc installation)
> 
> -------
> The reason to use mixed MPI & OpenMP is mainly less memory consumption, 
> compared to pure MPI.  Timewise probably it is just slightly faster. (I think 
> that's the case with many codes.)
> 
> 
> Sherry
> 
> On Thu, Jan 21, 2021 at 7:20 PM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> On Thu, Jan 21, 2021 at 10:16 PM Barry Smith <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
>> On Jan 21, 2021, at 9:11 PM, Mark Adams <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I have tried it and it hangs, but that is expected. This is not something 
>> she has prepared for.
>> 
>> I am working with Sherry on it.
>> 
>> And she is fine with just one thread and suggests it if she is in a thread. 
>> 
>> Now that I think about it, I don't understand why she needs OpenMP if she 
>> can live with OMP_NUM_THREADS=1.
> 
>  It is very possible it was just a coding decision by one of her students and 
> with a few ifdef in her code should would not need the OpenMP but I don't 
> have the time or energy to check her code and design decision.
> 
> Oh yea there OMP calls like omp_num_threads() that need something. There is 
> probably a omp1.h file somewhere in the world like our serial MPI.
>  
> 
>   Barry
> 
>> 
>> Mark
>> 
>> 
>> 
>> On Thu, Jan 21, 2021 at 9:30 PM Barry Smith <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 
>>> On Jan 21, 2021, at 5:37 PM, Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> This did not work. I verified that MPI_Init_thread is being called 
>>> correctly and that MPI returns that it supports this highest level of 
>>> thread safety.
>>> 
>>> I am going to ask ORNL. 
>>> 
>>> And if I use:
>>> 
>>> -fieldsplit_i1_ksp_norm_type none
>>> -fieldsplit_i1_ksp_max_it 300
>>> 
>>> for all 9 "i" variables, I can run normal iterations on the 10th variable, 
>>> in a 10 species problem, and it works perfectly with 10 threads.
>>> 
>>> So it is definitely that VecNorm is not thread safe.
>>> 
>>> And, I want to call SuperLU_dist, which uses threads, but I don't want 
>>> SuperLU to start using threads. Is there a way to tell superLU that there 
>>> are no threads but have PETSc use them?
>> 
>>   My interpretation and Satish's for many years is that SuperLU_DIST has to 
>> be built with and use OpenMP in order to work with CUDA. 
>> 
>>   def formCMakeConfigureArgs(self):
>>     args = config.package.CMakePackage.formCMakeConfigureArgs(self)
>>     if self.openmp.found:
>>       self.usesopenmp = 'yes'
>>     else:
>>       args.append('-DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE')
>>     if self.cuda.found:
>>       if not self.openmp.found:
>>         raise RuntimeError('SuperLU_DIST GPU code currently requires OpenMP. 
>> Use --with-openmp=1')
>> 
>> But this could be ok. You use OpenMP and then it uses OpenMP internally, 
>> each doing their own business (what could go wrong :-)).
>> 
>> Have you tried it?
>> 
>>   Barry
>> 
>> 
>>> 
>>> Thanks,
>>> Mark
>>> 
>>> On Thu, Jan 21, 2021 at 5:19 PM Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> OK, the problem is probably:
>>> 
>>> PetscMPIInt PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_FUNNELED;
>>> 
>>> There is an example that sets:
>>> 
>>> PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_MULTIPLE;
>>> 
>>> This is what I need.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Jan 21, 2021 at 2:26 PM Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> On Thu, Jan 21, 2021 at 2:11 PM Matthew Knepley <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> On Thu, Jan 21, 2021 at 2:02 PM Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> On Thu, Jan 21, 2021 at 1:44 PM Matthew Knepley <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> On Thu, Jan 21, 2021 at 11:16 AM Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Yes, the problem is that each KSP solver is running in an OMP thread (So at 
>>> this point it only works for SELF and its Landau so it is all I need). It 
>>> looks like MPI reductions called with a comm_self are not thread safe (eg, 
>>> the could say, this is one proc, thus, just copy send --> recv, but they 
>>> don't)
>>> 
>>> Instead of using SELF, how about Comm_dup() for each thread?
>>> 
>>> OK, raw MPI_Comm_dup. I tried PetscCommDup. Let me this.
>>> Thanks, 
>>> 
>>> You would have to dup them all outside the OMP section, since it is not 
>>> threadsafe. Then each thread uses one I think.
>>> 
>>> Yea sure. I do it in SetUp.
>>> 
>>> Well that worked to get different Comms, finally, I still get the same 
>>> problem. The number of iterations differ wildly. This two species and two 
>>> threads (13 SNES its that is not deterministic). Way below is one thread (8 
>>> its) and fairly uniform iteration counts.
>>> 
>>> Maybe this MPI is just not thread safe at all. Let me look into it.
>>> Thanks anyway,
>>> 
>>>    0 SNES Function norm 4.974994975313e-03
>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x80017c60. 
>>> Comms pc=0x67ad27c0 ksp=0x7ffe1600 newcomm=0x8014b6e0
>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7ffdabc0. 
>>> Comms pc=0x67ad27c0 ksp=0x7fff70d0 newcomm=0x7ffe9980
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 282
>>>     1 SNES Function norm 1.836376279964e-05
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 19
>>>     2 SNES Function norm 3.059930074740e-07
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 15
>>>     3 SNES Function norm 4.744275398121e-08
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 4
>>>     4 SNES Function norm 4.014828563316e-08
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 456
>>>     5 SNES Function norm 5.670836337808e-09
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 2
>>>     6 SNES Function norm 2.410421401323e-09
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 18
>>>     7 SNES Function norm 6.533948191791e-10
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 458
>>>     8 SNES Function norm 1.008133815842e-10
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 9
>>>     9 SNES Function norm 1.690450876038e-11
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 4
>>>    10 SNES Function norm 1.336383986009e-11
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 463
>>>    11 SNES Function norm 1.873022410774e-12
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 113
>>>    12 SNES Function norm 1.801834606518e-13
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>>> 1
>>>    13 SNES Function norm 1.004397317339e-13
>>>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 13
>>> 
>>> 
>>> 
>>> 
>>>     0 SNES Function norm 4.974994975313e-03
>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x6e265010. 
>>> Comms pc=0x56450340 ksp=0x6e2168d0 newcomm=0x6e265090
>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x6e25bc40. 
>>> Comms pc=0x56450340 ksp=0x6e22c1d0 newcomm=0x6e21e8f0
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 282
>>>     1 SNES Function norm 1.836376279963e-05
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 380
>>>     2 SNES Function norm 3.018499983019e-07
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 387
>>>     3 SNES Function norm 1.826353175637e-08
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 391
>>>     4 SNES Function norm 1.378600599548e-09
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 392
>>>     5 SNES Function norm 1.077289085611e-10
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 394
>>>     6 SNES Function norm 8.571891727748e-12
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 395
>>>     7 SNES Function norm 6.897647643450e-13
>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>>> 395
>>>     8 SNES Function norm 5.606434614114e-14
>>>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 8
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>>    Matt
>>>  
>>>   Matt
>>>  
>>> On Thu, Jan 21, 2021 at 10:46 AM Matthew Knepley <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> On Thu, Jan 21, 2021 at 10:34 AM Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> It looks like PETSc is just too clever for me. I am trying to get a 
>>> different MPI_Comm into each block, but PETSc is thwarting me:
>>> 
>>> It looks like you are using SELF. Is that what you want? Do you want a 
>>> bunch of comms with the same group, but independent somehow? I am confused.
>>> 
>>>    Matt
>>>  
>>>   if (jac->use_openmp) {
>>>     ierr          = KSPCreate(MPI_COMM_SELF,&ilink->ksp);CHKERRQ(ierr);
>>> PetscPrintf(PETSC_COMM_SELF,"In PCFieldSplitSetFields_FieldSplit with 
>>> -------------- link: %p. Comms %p 
>>> %p\n",ilink,PetscObjectComm((PetscObject)pc),PetscObjectComm((PetscObject)ilink->ksp));
>>>   } else {
>>>     ierr          = 
>>> KSPCreate(PetscObjectComm((PetscObject)pc),&ilink->ksp);CHKERRQ(ierr);
>>>   }
>>> 
>>> produces:
>>> 
>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7e9cb4f0. 
>>> Comms 0x660c6ad0 0x660c6ad0
>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7e88f7d0. 
>>> Comms 0x660c6ad0 0x660c6ad0
>>> 
>>> How can I work around this?
>>> 
>>> 
>>> On Thu, Jan 21, 2021 at 7:41 AM Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> On Wed, Jan 20, 2021 at 6:21 PM Barry Smith <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>>> On Jan 20, 2021, at 3:09 PM, Mark Adams <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> So I put in a temporary hack to get the first Fieldsplit apply to NOT use 
>>>> OMP and it sort of works. 
>>>> 
>>>> Preonly/lu is fine. GMRES calls vector creates/dups in every solve so that 
>>>> is a big problem.
>>> 
>>>   It should definitely not be creating vectors "in every" solve. But it 
>>> does do lazy allocation of needed restarted vectors which may make it look 
>>> like it is creating "every" vectors in every solve.  You can use 
>>> -ksp_gmres_preallocate to force it to create all the restart vectors up 
>>> front at KSPSetUp(). 
>>> 
>>> Well, I run the first solve w/o OMP and I see Vec dups in cuSparse Vecs in 
>>> the 2nd solve. 
>>>  
>>> 
>>>   Why is creating vectors "at every solve" a problem? It is not thread safe 
>>> I guess?
>>> 
>>> It dies when it looks at the options database, in a Free in the get-options 
>>> method to be exact (see stacks). 
>>> 
>>> ======= Backtrace: =========
>>> /lib64/libc.so.6(cfree+0x4a0)[0x200021839be0]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscFreeAlign+0x4c)[0x2000002a368c]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscOptionsEnd_Private+0xf4)[0x2000002e53f0]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x7c6c28)[0x2000008b6c28]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreate_SeqCUDA+0x11c)[0x20000052c510]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecSetType+0x670)[0x200000549664]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreateSeqCUDA+0x150)[0x20000052c0b0]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x43c198)[0x20000052c198]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicate+0x44)[0x200000542168]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs_Default+0x148)[0x200000543820]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs+0x54)[0x2000005425f4]
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPCreateVecs+0x4b4)[0x2000016f0aec]
>>> 
>>>  
>>> 
>>>> Richardson works except the convergence test gets confused, presumably 
>>>> because MPI reductions with PETSC_COMM_SELF is not threadsafe.
>>> 
>>>> 
>>>> One fix for the norms might be to create each subdomain solver with a 
>>>> different communicator.
>>> 
>>>    Yes you could do that. It might actually be the correct thing to do 
>>> also, if you have multiple threads call MPI reductions on the same 
>>> communicator that would be a problem. Each KSP should get a new MPI_Comm. 
>>> 
>>> OK. I will only do this.
>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>> 
>

Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Reply via email to