I have tried it and it hangs, but that is expected. This is not something she has prepared for.
I am working with Sherry on it. And she is fine with just one thread and suggests it if she is in a thread. Now that I think about it, I don't understand why she needs OpenMP if she can live with OMP_NUM_THREADS=1. Mark On Thu, Jan 21, 2021 at 9:30 PM Barry Smith <[email protected]> wrote: > > > On Jan 21, 2021, at 5:37 PM, Mark Adams <[email protected]> wrote: > > This did not work. I verified that MPI_Init_thread is being called > correctly and that MPI returns that it supports this highest level of > thread safety. > > I am going to ask ORNL. > > And if I use: > > -fieldsplit_i1_ksp_norm_type none > -fieldsplit_i1_ksp_max_it 300 > > for all 9 "i" variables, I can run normal iterations on the 10th variable, > in a 10 species problem, and it works perfectly with 10 threads. > > So it is definitely that VecNorm is not thread safe. > > And, I want to call SuperLU_dist, which uses threads, but I don't want > SuperLU to start using threads. Is there a way to tell superLU that there > are no threads but have PETSc use them? > > > My interpretation and Satish's for many years is that SuperLU_DIST has > to be built with and use OpenMP in order to work with CUDA. > > def formCMakeConfigureArgs(self): > args = config.package.CMakePackage.formCMakeConfigureArgs(self) > if self.openmp.found: > self.usesopenmp = 'yes' > else: > args.append('-DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE') > if self.cuda.found: > if not self.openmp.found: > raise RuntimeError('SuperLU_DIST GPU code currently requires > OpenMP. Use --with-openmp=1') > > But this could be ok. You use OpenMP and then it uses OpenMP internally, > each doing their own business (what could go wrong :-)). > > Have you tried it? > > Barry > > > > Thanks, > Mark > > On Thu, Jan 21, 2021 at 5:19 PM Mark Adams <[email protected]> wrote: > >> OK, the problem is probably: >> >> PetscMPIInt PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_FUNNELED; >> >> There is an example that sets: >> >> PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_MULTIPLE; >> >> This is what I need. >> >> >> >> >> On Thu, Jan 21, 2021 at 2:26 PM Mark Adams <[email protected]> wrote: >> >>> >>> >>> On Thu, Jan 21, 2021 at 2:11 PM Matthew Knepley <[email protected]> >>> wrote: >>> >>>> On Thu, Jan 21, 2021 at 2:02 PM Mark Adams <[email protected]> wrote: >>>> >>>>> On Thu, Jan 21, 2021 at 1:44 PM Matthew Knepley <[email protected]> >>>>> wrote: >>>>> >>>>>> On Thu, Jan 21, 2021 at 11:16 AM Mark Adams <[email protected]> wrote: >>>>>> >>>>>>> Yes, the problem is that each KSP solver is running in an OMP thread >>>>>>> (So at this point it only works for SELF and its Landau so it is all I >>>>>>> need). It looks like MPI reductions called with a comm_self are not >>>>>>> thread >>>>>>> safe (eg, the could say, this is one proc, thus, just copy send --> >>>>>>> recv, >>>>>>> but they don't) >>>>>>> >>>>>> >>>>>> Instead of using SELF, how about Comm_dup() for each thread? >>>>>> >>>>> >>>>> OK, raw MPI_Comm_dup. I tried PetscCommDup. Let me this. >>>>> Thanks, >>>>> >>>> >>>> You would have to dup them all outside the OMP section, since it is not >>>> threadsafe. Then each thread uses one I think. >>>> >>> >>> Yea sure. I do it in SetUp. >>> >>> Well that worked to get *different Comms*, finally, I still get the >>> same problem. The number of iterations differ wildly. This two species and >>> two threads (13 SNES its that is not deterministic). Way below is one >>> thread (8 its) and fairly uniform iteration counts. >>> >>> Maybe this MPI is just not thread safe at all. Let me look into it. >>> Thanks anyway, >>> >>> 0 SNES Function norm 4.974994975313e-03 >>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>> 0x80017c60. Comms pc=0x67ad27c0 ksp=*0x7ffe1600* newcomm=0x8014b6e0 >>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>> 0x7ffdabc0. Comms pc=0x67ad27c0 ksp=*0x7fff70d0* newcomm=0x7ffe9980 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 282 >>> 1 SNES Function norm 1.836376279964e-05 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 19 >>> 2 SNES Function norm 3.059930074740e-07 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 15 >>> 3 SNES Function norm 4.744275398121e-08 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 4 >>> 4 SNES Function norm 4.014828563316e-08 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 456 >>> 5 SNES Function norm 5.670836337808e-09 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 2 >>> 6 SNES Function norm 2.410421401323e-09 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 18 >>> 7 SNES Function norm 6.533948191791e-10 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 458 >>> 8 SNES Function norm 1.008133815842e-10 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 9 >>> 9 SNES Function norm 1.690450876038e-11 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 4 >>> 10 SNES Function norm 1.336383986009e-11 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 463 >>> 11 SNES Function norm 1.873022410774e-12 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 113 >>> 12 SNES Function norm 1.801834606518e-13 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>> iterations 1 >>> 13 SNES Function norm 1.004397317339e-13 >>> Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 13 >>> >>> >>> >>> >>> 0 SNES Function norm 4.974994975313e-03 >>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>> 0x6e265010. Comms pc=0x56450340 ksp=0x6e2168d0 newcomm=0x6e265090 >>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>> 0x6e25bc40. Comms pc=0x56450340 ksp=0x6e22c1d0 newcomm=0x6e21e8f0 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 282 >>> 1 SNES Function norm 1.836376279963e-05 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 380 >>> 2 SNES Function norm 3.018499983019e-07 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 387 >>> 3 SNES Function norm 1.826353175637e-08 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 391 >>> 4 SNES Function norm 1.378600599548e-09 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 392 >>> 5 SNES Function norm 1.077289085611e-10 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 394 >>> 6 SNES Function norm 8.571891727748e-12 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 395 >>> 7 SNES Function norm 6.897647643450e-13 >>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>> iterations 395 >>> 8 SNES Function norm 5.606434614114e-14 >>> Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 8 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> >>>> Matt >>>> >>>> >>>>> Matt >>>>>> >>>>>> >>>>>>> On Thu, Jan 21, 2021 at 10:46 AM Matthew Knepley <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> On Thu, Jan 21, 2021 at 10:34 AM Mark Adams <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> It looks like PETSc is just too clever for me. I am trying to get >>>>>>>>> a different MPI_Comm into each block, but PETSc is thwarting me: >>>>>>>>> >>>>>>>> >>>>>>>> It looks like you are using SELF. Is that what you want? Do you >>>>>>>> want a bunch of comms with the same group, but independent somehow? I >>>>>>>> am >>>>>>>> confused. >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>> >>>>>>>>> if (jac->use_openmp) { >>>>>>>>> ierr = >>>>>>>>> KSPCreate(MPI_COMM_SELF,&ilink->ksp);CHKERRQ(ierr); >>>>>>>>> PetscPrintf(PETSC_COMM_SELF,"In PCFieldSplitSetFields_FieldSplit >>>>>>>>> with -------------- link: %p. Comms %p >>>>>>>>> %p\n",ilink,PetscObjectComm((PetscObject)pc),PetscObjectComm((PetscObject)ilink->ksp)); >>>>>>>>> } else { >>>>>>>>> ierr = >>>>>>>>> KSPCreate(PetscObjectComm((PetscObject)pc),&ilink->ksp);CHKERRQ(ierr); >>>>>>>>> } >>>>>>>>> >>>>>>>>> produces: >>>>>>>>> >>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>>>>> 0x7e9cb4f0. Comms 0x660c6ad0 0x660c6ad0 >>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>>>>> 0x7e88f7d0. Comms 0x660c6ad0 0x660c6ad0 >>>>>>>>> >>>>>>>>> How can I work around this? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 21, 2021 at 7:41 AM Mark Adams <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 20, 2021 at 6:21 PM Barry Smith <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jan 20, 2021, at 3:09 PM, Mark Adams <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> So I put in a temporary hack to get the first Fieldsplit apply >>>>>>>>>>> to NOT use OMP and it sort of works. >>>>>>>>>>> >>>>>>>>>>> Preonly/lu is fine. GMRES calls vector creates/dups in every >>>>>>>>>>> solve so that is a big problem. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It should definitely not be creating vectors "in every" solve. >>>>>>>>>>> But it does do lazy allocation of needed restarted vectors which >>>>>>>>>>> may make >>>>>>>>>>> it look like it is creating "every" vectors in every solve. You can >>>>>>>>>>> use -ksp_gmres_preallocate to force it to create all the restart >>>>>>>>>>> vectors up >>>>>>>>>>> front at KSPSetUp(). >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Well, I run the first solve w/o OMP and I see Vec dups in >>>>>>>>>> cuSparse Vecs in the 2nd solve. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Why is creating vectors "at every solve" a problem? It is not >>>>>>>>>>> thread safe I guess? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> It dies when it looks at the options database, in a Free in the >>>>>>>>>> get-options method to be exact (see stacks). >>>>>>>>>> >>>>>>>>>> ======= Backtrace: ========= >>>>>>>>>> /lib64/libc.so.6(cfree+0x4a0)[0x200021839be0] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscFreeAlign+0x4c)[0x2000002a368c] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscOptionsEnd_Private+0xf4)[0x2000002e53f0] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x7c6c28)[0x2000008b6c28] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreate_SeqCUDA+0x11c)[0x20000052c510] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecSetType+0x670)[0x200000549664] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreateSeqCUDA+0x150)[0x20000052c0b0] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x43c198)[0x20000052c198] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicate+0x44)[0x200000542168] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs_Default+0x148)[0x200000543820] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs+0x54)[0x2000005425f4] >>>>>>>>>> >>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPCreateVecs+0x4b4)[0x2000016f0aec] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Richardson works except the convergence test gets confused, >>>>>>>>>>> presumably because MPI reductions with PETSC_COMM_SELF is not >>>>>>>>>>> threadsafe. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> One fix for the norms might be to create each subdomain solver >>>>>>>>>>> with a different communicator. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yes you could do that. It might actually be the correct thing >>>>>>>>>>> to do also, if you have multiple threads call MPI reductions on the >>>>>>>>>>> same >>>>>>>>>>> communicator that would be a problem. Each KSP should get a new >>>>>>>>>>> MPI_Comm. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> OK. I will only do this. >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted before they begin their >>>>>>>> experiments is infinitely more interesting than any results to which >>>>>>>> their >>>>>>>> experiments lead. >>>>>>>> -- Norbert Wiener >>>>>>>> >>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to which >>>>>> their >>>>>> experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> <http://www.cse.buffalo.edu/~knepley/> >>>> >>> >
