The library is thread safe and its functions can be called from multiple host 
threads, even with the same handle. When multiple threads share the same 
handle, 



extreme care needs to be taken when the handle configuration is changed because 
that change will affect potentially subsequent cuBLAS calls in all threads. 

                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


It is even more true for the destruction of the handle. So it is not 
recommended that multiple thread share the same cuBLAS handle.


From my reading of this there should be absolutely no issue. The handle 
configuration is never being changed. It should be set in PetscInitialize() and 
destroyed in PetscFinalize(). Since lazy configuration of cuBLAS is turned off 
(right?).

  Barry



> On Jan 22, 2021, at 7:51 AM, Mark Adams <[email protected]> wrote:
> 
> OK, I found the problem. It is in cuBlas. This is the code for VecNorm in 
> VecCuda, with print statement added:
> 
>     cberr = cublasXnrm2(cublasv2handle,bn,xarray,one,z);CHKERRCUBLAS(cberr);
>     PetscScalar h_val; cudaMemcpy(&h_val, &xarray[0], sizeof(PetscScalar), 
> cudaMemcpyDeviceToHost);
>     PetscPrintf(PETSC_COMM_SELF,"VecNorm_SeqCUDA %d) x[0]=%g 
> |z|=%g\n",omp_get_thread_num(),h_val,*z);
> 
> After running a small job several times (this is not deterministic) I got a 
> run with a different result and the first VecNorm in an OMP loops gives:
> 
> VecNorm_SeqCUDA 0) x[0]=-8.38153e-08 |z|=0.
> 
> Clearly wrong. The cuBlas doc says:
> 
> 2.1.3. Thread Safety 
> <https://docs.nvidia.com/cuda/cublas/index.html#thread-safety2>
> The library is thread safe and its functions can be called from multiple host 
> threads, even with the same handle. When multiple threads share the same 
> handle, extreme care needs to be taken when the handle configuration is 
> changed because that change will affect potentially subsequent cuBLAS calls 
> in all threads. It is even more true for the destruction of the handle. So it 
> is not recommended that multiple thread share the same cuBLAS handle.
> 
> There are static handles in src/sys/objects/cuda/handle.c. Do you think I 
> should make these arrays of handles for each OMP thread?
> 
> If so, should I make a global #define PETSC_MAX_THREADS? assuming there is 
> nothing like this already.
> 
> Mark
> 
> 
> On Thu, Jan 21, 2021 at 6:37 PM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> This did not work. I verified that MPI_Init_thread is being called correctly 
> and that MPI returns that it supports this highest level of thread safety.
> 
> I am going to ask ORNL. 
> 
> And if I use:
> 
> -fieldsplit_i1_ksp_norm_type none
> -fieldsplit_i1_ksp_max_it 300
> 
> for all 9 "i" variables, I can run normal iterations on the 10th variable, in 
> a 10 species problem, and it works perfectly with 10 threads.
> 
> So it is definitely that VecNorm is not thread safe.
> 
> And, I want to call SuperLU_dist, which uses threads, but I don't want 
> SuperLU to start using threads. Is there a way to tell superLU that there are 
> no threads but have PETSc use them?
> 
> Thanks,
> Mark
> 
> On Thu, Jan 21, 2021 at 5:19 PM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> OK, the problem is probably:
> 
> PetscMPIInt PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_FUNNELED;
> 
> There is an example that sets:
> 
> PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_MULTIPLE;
> 
> This is what I need.
> 
> 
> 
> 
> On Thu, Jan 21, 2021 at 2:26 PM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> On Thu, Jan 21, 2021 at 2:11 PM Matthew Knepley <[email protected] 
> <mailto:[email protected]>> wrote:
> On Thu, Jan 21, 2021 at 2:02 PM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> On Thu, Jan 21, 2021 at 1:44 PM Matthew Knepley <[email protected] 
> <mailto:[email protected]>> wrote:
> On Thu, Jan 21, 2021 at 11:16 AM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> Yes, the problem is that each KSP solver is running in an OMP thread (So at 
> this point it only works for SELF and its Landau so it is all I need). It 
> looks like MPI reductions called with a comm_self are not thread safe (eg, 
> the could say, this is one proc, thus, just copy send --> recv, but they 
> don't)
> 
> Instead of using SELF, how about Comm_dup() for each thread?
> 
> OK, raw MPI_Comm_dup. I tried PetscCommDup. Let me this.
> Thanks, 
> 
> You would have to dup them all outside the OMP section, since it is not 
> threadsafe. Then each thread uses one I think.
> 
> Yea sure. I do it in SetUp.
> 
> Well that worked to get different Comms, finally, I still get the same 
> problem. The number of iterations differ wildly. This two species and two 
> threads (13 SNES its that is not deterministic). Way below is one thread (8 
> its) and fairly uniform iteration counts.
> 
> Maybe this MPI is just not thread safe at all. Let me look into it.
> Thanks anyway,
> 
>    0 SNES Function norm 4.974994975313e-03
> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x80017c60. 
> Comms pc=0x67ad27c0 ksp=0x7ffe1600 newcomm=0x8014b6e0
> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7ffdabc0. 
> Comms pc=0x67ad27c0 ksp=0x7fff70d0 newcomm=0x7ffe9980
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 282
>     1 SNES Function norm 1.836376279964e-05
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 19
>     2 SNES Function norm 3.059930074740e-07
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 15
>     3 SNES Function norm 4.744275398121e-08
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 4
>     4 SNES Function norm 4.014828563316e-08
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 456
>     5 SNES Function norm 5.670836337808e-09
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 2
>     6 SNES Function norm 2.410421401323e-09
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 18
>     7 SNES Function norm 6.533948191791e-10
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 458
>     8 SNES Function norm 1.008133815842e-10
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 9
>     9 SNES Function norm 1.690450876038e-11
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 4
>    10 SNES Function norm 1.336383986009e-11
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 463
>    11 SNES Function norm 1.873022410774e-12
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 113
>    12 SNES Function norm 1.801834606518e-13
>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 1
>    13 SNES Function norm 1.004397317339e-13
>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 13
> 
> 
> 
> 
>     0 SNES Function norm 4.974994975313e-03
> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x6e265010. 
> Comms pc=0x56450340 ksp=0x6e2168d0 newcomm=0x6e265090
> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x6e25bc40. 
> Comms pc=0x56450340 ksp=0x6e22c1d0 newcomm=0x6e21e8f0
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 282
>     1 SNES Function norm 1.836376279963e-05
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 380
>     2 SNES Function norm 3.018499983019e-07
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 387
>     3 SNES Function norm 1.826353175637e-08
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 391
>     4 SNES Function norm 1.378600599548e-09
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 392
>     5 SNES Function norm 1.077289085611e-10
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 394
>     6 SNES Function norm 8.571891727748e-12
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 395
>     7 SNES Function norm 6.897647643450e-13
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
> 395
>     8 SNES Function norm 5.606434614114e-14
>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 8
> 
> 
> 
> 
> 
> 
> 
>  
> 
>    Matt
>  
>   Matt
>  
> On Thu, Jan 21, 2021 at 10:46 AM Matthew Knepley <[email protected] 
> <mailto:[email protected]>> wrote:
> On Thu, Jan 21, 2021 at 10:34 AM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> It looks like PETSc is just too clever for me. I am trying to get a different 
> MPI_Comm into each block, but PETSc is thwarting me:
> 
> It looks like you are using SELF. Is that what you want? Do you want a bunch 
> of comms with the same group, but independent somehow? I am confused.
> 
>    Matt
>  
>   if (jac->use_openmp) {
>     ierr          = KSPCreate(MPI_COMM_SELF,&ilink->ksp);CHKERRQ(ierr);
> PetscPrintf(PETSC_COMM_SELF,"In PCFieldSplitSetFields_FieldSplit with 
> -------------- link: %p. Comms %p 
> %p\n",ilink,PetscObjectComm((PetscObject)pc),PetscObjectComm((PetscObject)ilink->ksp));
>   } else {
>     ierr          = 
> KSPCreate(PetscObjectComm((PetscObject)pc),&ilink->ksp);CHKERRQ(ierr);
>   }
> 
> produces:
> 
> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7e9cb4f0. 
> Comms 0x660c6ad0 0x660c6ad0
> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7e88f7d0. 
> Comms 0x660c6ad0 0x660c6ad0
> 
> How can I work around this?
> 
> 
> On Thu, Jan 21, 2021 at 7:41 AM Mark Adams <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> On Wed, Jan 20, 2021 at 6:21 PM Barry Smith <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
>> On Jan 20, 2021, at 3:09 PM, Mark Adams <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> So I put in a temporary hack to get the first Fieldsplit apply to NOT use 
>> OMP and it sort of works. 
>> 
>> Preonly/lu is fine. GMRES calls vector creates/dups in every solve so that 
>> is a big problem.
> 
>   It should definitely not be creating vectors "in every" solve. But it does 
> do lazy allocation of needed restarted vectors which may make it look like it 
> is creating "every" vectors in every solve.  You can use 
> -ksp_gmres_preallocate to force it to create all the restart vectors up front 
> at KSPSetUp(). 
> 
> Well, I run the first solve w/o OMP and I see Vec dups in cuSparse Vecs in 
> the 2nd solve. 
>  
> 
>   Why is creating vectors "at every solve" a problem? It is not thread safe I 
> guess?
> 
> It dies when it looks at the options database, in a Free in the get-options 
> method to be exact (see stacks). 
> 
> ======= Backtrace: =========
> /lib64/libc.so.6(cfree+0x4a0)[0x200021839be0]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscFreeAlign+0x4c)[0x2000002a368c]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscOptionsEnd_Private+0xf4)[0x2000002e53f0]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x7c6c28)[0x2000008b6c28]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreate_SeqCUDA+0x11c)[0x20000052c510]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecSetType+0x670)[0x200000549664]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreateSeqCUDA+0x150)[0x20000052c0b0]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x43c198)[0x20000052c198]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicate+0x44)[0x200000542168]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs_Default+0x148)[0x200000543820]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs+0x54)[0x2000005425f4]
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPCreateVecs+0x4b4)[0x2000016f0aec]
> 
>  
> 
>> Richardson works except the convergence test gets confused, presumably 
>> because MPI reductions with PETSC_COMM_SELF is not threadsafe.
> 
>> 
>> One fix for the norms might be to create each subdomain solver with a 
>> different communicator.
> 
>    Yes you could do that. It might actually be the correct thing to do also, 
> if you have multiple threads call MPI reductions on the same communicator 
> that would be a problem. Each KSP should get a new MPI_Comm. 
> 
> OK. I will only do this.
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Reply via email to