Satish, can you tell me how I might configure SuperLU_dist w/o this _OPENMP ? Thanks, Mark
On Thu, Jan 21, 2021 at 11:57 PM Xiaoye S. Li <[email protected]> wrote: > All the OpenMP calls are surrounded by > > #ifdef _OPENMP > ... > #endif > > You can disable openmp during Cmake installation, with the following: > -Denable_openmp=FALSE > (the default is true) > > (I think Satish knows how to do this with PETSc installation) > > ------- > The reason to use mixed MPI & OpenMP is mainly less memory consumption, > compared to pure MPI. Timewise probably it is just slightly faster. (I > think that's the case with many codes.) > > > Sherry > > On Thu, Jan 21, 2021 at 7:20 PM Mark Adams <[email protected]> wrote: > >> >> >> On Thu, Jan 21, 2021 at 10:16 PM Barry Smith <[email protected]> wrote: >> >>> >>> >>> On Jan 21, 2021, at 9:11 PM, Mark Adams <[email protected]> wrote: >>> >>> I have tried it and it hangs, but that is expected. This is not >>> something she has prepared for. >>> >>> I am working with Sherry on it. >>> >>> And she is fine with just one thread and suggests it if she is in a >>> thread. >>> >>> Now that I think about it, I don't understand why she needs OpenMP if >>> she can live with OMP_NUM_THREADS=1. >>> >>> >>> It is very possible it was just a coding decision by one of her >>> students and with a few ifdef in her code should would not need the OpenMP >>> but I don't have the time or energy to check her code and design decision. >>> >> >> Oh yea there OMP calls like omp_num_threads() that need something. There >> is probably a omp1.h file somewhere in the world like our serial MPI. >> >> >>> >>> Barry >>> >>> >>> Mark >>> >>> >>> >>> On Thu, Jan 21, 2021 at 9:30 PM Barry Smith <[email protected]> wrote: >>> >>>> >>>> >>>> On Jan 21, 2021, at 5:37 PM, Mark Adams <[email protected]> wrote: >>>> >>>> This did not work. I verified that MPI_Init_thread is being called >>>> correctly and that MPI returns that it supports this highest level of >>>> thread safety. >>>> >>>> I am going to ask ORNL. >>>> >>>> And if I use: >>>> >>>> -fieldsplit_i1_ksp_norm_type none >>>> -fieldsplit_i1_ksp_max_it 300 >>>> >>>> for all 9 "i" variables, I can run normal iterations on the 10th >>>> variable, in a 10 species problem, and it works perfectly with 10 threads. >>>> >>>> So it is definitely that VecNorm is not thread safe. >>>> >>>> And, I want to call SuperLU_dist, which uses threads, but I don't want >>>> SuperLU to start using threads. Is there a way to tell superLU that there >>>> are no threads but have PETSc use them? >>>> >>>> >>>> My interpretation and Satish's for many years is that SuperLU_DIST >>>> has to be built with and use OpenMP in order to work with CUDA. >>>> >>>> def formCMakeConfigureArgs(self): >>>> args = config.package.CMakePackage.formCMakeConfigureArgs(self) >>>> if self.openmp.found: >>>> self.usesopenmp = 'yes' >>>> else: >>>> args.append('-DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE') >>>> if self.cuda.found: >>>> if not self.openmp.found: >>>> raise RuntimeError('SuperLU_DIST GPU code currently requires >>>> OpenMP. Use --with-openmp=1') >>>> >>>> But this could be ok. You use OpenMP and then it uses OpenMP >>>> internally, each doing their own business (what could go wrong :-)). >>>> >>>> Have you tried it? >>>> >>>> Barry >>>> >>>> >>>> >>>> Thanks, >>>> Mark >>>> >>>> On Thu, Jan 21, 2021 at 5:19 PM Mark Adams <[email protected]> wrote: >>>> >>>>> OK, the problem is probably: >>>>> >>>>> PetscMPIInt PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_FUNNELED; >>>>> >>>>> There is an example that sets: >>>>> >>>>> PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_MULTIPLE; >>>>> >>>>> This is what I need. >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Jan 21, 2021 at 2:26 PM Mark Adams <[email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thu, Jan 21, 2021 at 2:11 PM Matthew Knepley <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> On Thu, Jan 21, 2021 at 2:02 PM Mark Adams <[email protected]> wrote: >>>>>>> >>>>>>>> On Thu, Jan 21, 2021 at 1:44 PM Matthew Knepley <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On Thu, Jan 21, 2021 at 11:16 AM Mark Adams <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yes, the problem is that each KSP solver is running in an OMP >>>>>>>>>> thread (So at this point it only works for SELF and its Landau so it >>>>>>>>>> is all >>>>>>>>>> I need). It looks like MPI reductions called with a comm_self are not >>>>>>>>>> thread safe (eg, the could say, this is one proc, thus, just copy >>>>>>>>>> send --> >>>>>>>>>> recv, but they don't) >>>>>>>>>> >>>>>>>>> >>>>>>>>> Instead of using SELF, how about Comm_dup() for each thread? >>>>>>>>> >>>>>>>> >>>>>>>> OK, raw MPI_Comm_dup. I tried PetscCommDup. Let me this. >>>>>>>> Thanks, >>>>>>>> >>>>>>> >>>>>>> You would have to dup them all outside the OMP section, since it is >>>>>>> not threadsafe. Then each thread uses one I think. >>>>>>> >>>>>> >>>>>> Yea sure. I do it in SetUp. >>>>>> >>>>>> Well that worked to get *different Comms*, finally, I still get the >>>>>> same problem. The number of iterations differ wildly. This two species >>>>>> and >>>>>> two threads (13 SNES its that is not deterministic). Way below is one >>>>>> thread (8 its) and fairly uniform iteration counts. >>>>>> >>>>>> Maybe this MPI is just not thread safe at all. Let me look into it. >>>>>> Thanks anyway, >>>>>> >>>>>> 0 SNES Function norm 4.974994975313e-03 >>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>> 0x80017c60. Comms pc=0x67ad27c0 ksp=*0x7ffe1600* newcomm=0x8014b6e0 >>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>> 0x7ffdabc0. Comms pc=0x67ad27c0 ksp=*0x7fff70d0* newcomm=0x7ffe9980 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 282 >>>>>> 1 SNES Function norm 1.836376279964e-05 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 19 >>>>>> 2 SNES Function norm 3.059930074740e-07 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 15 >>>>>> 3 SNES Function norm 4.744275398121e-08 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 4 >>>>>> 4 SNES Function norm 4.014828563316e-08 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 456 >>>>>> 5 SNES Function norm 5.670836337808e-09 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 2 >>>>>> 6 SNES Function norm 2.410421401323e-09 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 18 >>>>>> 7 SNES Function norm 6.533948191791e-10 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 458 >>>>>> 8 SNES Function norm 1.008133815842e-10 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 9 >>>>>> 9 SNES Function norm 1.690450876038e-11 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 4 >>>>>> 10 SNES Function norm 1.336383986009e-11 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 463 >>>>>> 11 SNES Function norm 1.873022410774e-12 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 113 >>>>>> 12 SNES Function norm 1.801834606518e-13 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>>>> iterations 1 >>>>>> 13 SNES Function norm 1.004397317339e-13 >>>>>> Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE >>>>>> iterations 13 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 0 SNES Function norm 4.974994975313e-03 >>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>> 0x6e265010. Comms pc=0x56450340 ksp=0x6e2168d0 newcomm=0x6e265090 >>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>> 0x6e25bc40. Comms pc=0x56450340 ksp=0x6e22c1d0 newcomm=0x6e21e8f0 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 282 >>>>>> 1 SNES Function norm 1.836376279963e-05 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 380 >>>>>> 2 SNES Function norm 3.018499983019e-07 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 387 >>>>>> 3 SNES Function norm 1.826353175637e-08 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 391 >>>>>> 4 SNES Function norm 1.378600599548e-09 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 392 >>>>>> 5 SNES Function norm 1.077289085611e-10 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 394 >>>>>> 6 SNES Function norm 8.571891727748e-12 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 395 >>>>>> 7 SNES Function norm 6.897647643450e-13 >>>>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>>>> iterations 395 >>>>>> 8 SNES Function norm 5.606434614114e-14 >>>>>> Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE >>>>>> iterations 8 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>> >>>>>>>> Matt >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Thu, Jan 21, 2021 at 10:46 AM Matthew Knepley < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> On Thu, Jan 21, 2021 at 10:34 AM Mark Adams <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> It looks like PETSc is just too clever for me. I am trying to >>>>>>>>>>>> get a different MPI_Comm into each block, but PETSc is thwarting >>>>>>>>>>>> me: >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It looks like you are using SELF. Is that what you want? Do you >>>>>>>>>>> want a bunch of comms with the same group, but independent somehow? >>>>>>>>>>> I am >>>>>>>>>>> confused. >>>>>>>>>>> >>>>>>>>>>> Matt >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> if (jac->use_openmp) { >>>>>>>>>>>> ierr = >>>>>>>>>>>> KSPCreate(MPI_COMM_SELF,&ilink->ksp);CHKERRQ(ierr); >>>>>>>>>>>> PetscPrintf(PETSC_COMM_SELF,"In >>>>>>>>>>>> PCFieldSplitSetFields_FieldSplit with -------------- link: %p. >>>>>>>>>>>> Comms %p >>>>>>>>>>>> %p\n",ilink,PetscObjectComm((PetscObject)pc),PetscObjectComm((PetscObject)ilink->ksp)); >>>>>>>>>>>> } else { >>>>>>>>>>>> ierr = >>>>>>>>>>>> KSPCreate(PetscObjectComm((PetscObject)pc),&ilink->ksp);CHKERRQ(ierr); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> produces: >>>>>>>>>>>> >>>>>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>>>>>>>> 0x7e9cb4f0. Comms 0x660c6ad0 0x660c6ad0 >>>>>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>>>>>>>> 0x7e88f7d0. Comms 0x660c6ad0 0x660c6ad0 >>>>>>>>>>>> >>>>>>>>>>>> How can I work around this? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 21, 2021 at 7:41 AM Mark Adams <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 20, 2021 at 6:21 PM Barry Smith <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Jan 20, 2021, at 3:09 PM, Mark Adams <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> So I put in a temporary hack to get the first Fieldsplit >>>>>>>>>>>>>> apply to NOT use OMP and it sort of works. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Preonly/lu is fine. GMRES calls vector creates/dups in every >>>>>>>>>>>>>> solve so that is a big problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> It should definitely not be creating vectors "in every" >>>>>>>>>>>>>> solve. But it does do lazy allocation of needed restarted >>>>>>>>>>>>>> vectors which may >>>>>>>>>>>>>> make it look like it is creating "every" vectors in every solve. >>>>>>>>>>>>>> You can >>>>>>>>>>>>>> use -ksp_gmres_preallocate to force it to create all the restart >>>>>>>>>>>>>> vectors up >>>>>>>>>>>>>> front at KSPSetUp(). >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Well, I run the first solve w/o OMP and I see Vec dups in >>>>>>>>>>>>> cuSparse Vecs in the 2nd solve. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Why is creating vectors "at every solve" a problem? It is >>>>>>>>>>>>>> not thread safe I guess? >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> It dies when it looks at the options database, in a Free in >>>>>>>>>>>>> the get-options method to be exact (see stacks). >>>>>>>>>>>>> >>>>>>>>>>>>> ======= Backtrace: ========= >>>>>>>>>>>>> /lib64/libc.so.6(cfree+0x4a0)[0x200021839be0] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscFreeAlign+0x4c)[0x2000002a368c] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscOptionsEnd_Private+0xf4)[0x2000002e53f0] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x7c6c28)[0x2000008b6c28] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreate_SeqCUDA+0x11c)[0x20000052c510] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecSetType+0x670)[0x200000549664] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreateSeqCUDA+0x150)[0x20000052c0b0] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x43c198)[0x20000052c198] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicate+0x44)[0x200000542168] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs_Default+0x148)[0x200000543820] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs+0x54)[0x2000005425f4] >>>>>>>>>>>>> >>>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPCreateVecs+0x4b4)[0x2000016f0aec] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Richardson works except the convergence test gets confused, >>>>>>>>>>>>>> presumably because MPI reductions with PETSC_COMM_SELF is not >>>>>>>>>>>>>> threadsafe. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> One fix for the norms might be to create each >>>>>>>>>>>>>> subdomain solver with a different communicator. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes you could do that. It might actually be the correct >>>>>>>>>>>>>> thing to do also, if you have multiple threads call MPI >>>>>>>>>>>>>> reductions on the >>>>>>>>>>>>>> same communicator that would be a problem. Each KSP should get a >>>>>>>>>>>>>> new >>>>>>>>>>>>>> MPI_Comm. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> OK. I will only do this. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>>>> experiments is infinitely more interesting than any results to >>>>>>>>>>> which their >>>>>>>>>>> experiments lead. >>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>> >>>>>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>> experiments is infinitely more interesting than any results to which >>>>>>>>> their >>>>>>>>> experiments lead. >>>>>>>>> -- Norbert Wiener >>>>>>>>> >>>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments is infinitely more interesting than any results to which >>>>>>> their >>>>>>> experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>> >>>>>> >>>> >>>
