Fande, let me know if you'd like to run the libCEED hyperelasticity solver on GPUs. It's matrix-free p-multigrid (assembling only the coarse problem). It uses DMPlex for mesh management.
Mark Adams <[email protected]> writes: > Also Fande, > > If you are _not_ using NVIDIA with the MPS system, then you should run with > the default -cells 1,1,1, and use just one MPI process and one GPU. > This will be fine for evaluating the GPU. > > If you want to use more than one MPI process per GPU (because you want to > use the CPUs in the rest of your app) then the MPS system is important (I > see 3x speedup) and I would use NVIDIA+MPS unless you talk to > someone/vendor knowledgeable about using more than one MPI/GPU. > > Now if you can use NVIDIA+MPS it would be interesting to compare GPU solver > performance with single vs multiple MPI/GPU. It should be faster to use one > MPI/GPU (running the same problem of course), but it would be interesting > to quantify this. > If you want to do this then I can explain how to do it. > > Thanks, > Mark > > On Mon, Dec 6, 2021 at 10:03 PM Mark Adams <[email protected]> wrote: > >> * snes/ex56 runs a convergence study and confusingly sets the options >> manually, thus erasing your -ex56_dm_refine. >> >> * To refine, use -max_conv_its N <3>, this sets the number of steps of >> refinement. That is, the length of the convergence study >> >> * You can adjust where it starts from with -cells i,j,k <1,1,1> >> You do want to set this if you have multiple MPI processes so that the >> size of this mesh is the number of processes. That way it starts with one >> cell per process and refines from there. >> >> * GPU speedup is all about subdomain size. AMG has lots of kernel launches >> and you need to overcome this before you get net gain. >> Very rough numbers: I see a speedup of about 5-10x with a few million >> equations per GPU. >> As Matt said the assembly is on the CPU and ex56 gets really slow on >> larger problems. Be prepared to run the largest case for close to an hour. >> This setup is not measured in KSP[SNES]Solve in the -log_view output so >> look at that. >> When you do this convergence study there will be a new stage created for >> each refinement, so one run will give you a range of problem size data. >> Each refinement step increases the problem size by 8x so when the solve >> times increase by ~8x then that tells you you are past the >> latency dominated regime. You want to get into that to see gain. >> >> * The end of the source file has example parameters that you should use >> (the gamg one) >> >> * src/snes/tests/ex13.c is designed to be a benchmark test and it >> partitions the problem better in parallel and has modern Plex usage. If you >> are doing large scale parallelism then you should use this. >> (It is a little hard to understand. Not well documented.) >> >> Hope that helps, >> Mark >> >> On Mon, Dec 6, 2021 at 9:05 PM Fande Kong <[email protected]> wrote: >> >>> >>> >>> On Mon, Dec 6, 2021 at 5:59 PM Matthew Knepley <[email protected]> wrote: >>> >>>> On Mon, Dec 6, 2021 at 7:54 PM Fande Kong <[email protected]> wrote: >>>> >>>>> Thanks, Matt, >>>>> >>>>> Sorry, I still have more questions on this example. How to refine mesh >>>>> to make the problem larger? >>>>> >>>>> I tried the following options, and none of them worked. I might do >>>>> something wrong. >>>>> >>>>> -ex56_dm_refine 9 >>>>> >>>>> and >>>>> >>>>> -dm_refine 4 >>>>> >>>> >>>> The mesh handling in this example does not conform to the others, but it >>>> appears that >>>> >>>> -ex56_dm_refine <k> >>>> >>>> should take effect at >>>> >>>> >>>> https://gitlab.com/petsc/petsc/-/blob/main/src/snes/tutorials/ex56.c#L381 >>>> >>>> >>> I was puzzled about this because DMSetFromOptions does not seem to >>> trigger -ex56_dm_refine. >>> >>> I did a search, and could not find where we call " -ex56_dm_refine" in >>> PETSc. >>> >>> I got the same result by running the following two combinations: >>> >>> 1) ./ex56 -log_view -snes_view -max_conv_its 3 -ex56_dm_refine 10 >>> >>> 2) ./ex56 -log_view -snes_view -max_conv_its 3 -ex56_dm_refine 0 >>> >>> Thanks, >>> >>> Fande >>> >>> >>> unless you are setting max_conv_its to 0 somehow. >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Thanks, >>>>> >>>>> Fande >>>>> >>>>> On Mon, Dec 6, 2021 at 5:04 PM Matthew Knepley <[email protected]> >>>>> wrote: >>>>> >>>>>> On Mon, Dec 6, 2021 at 7:02 PM Fande Kong <[email protected]> wrote: >>>>>> >>>>>>> Thanks, Matt >>>>>>> >>>>>>> On Mon, Dec 6, 2021 at 4:47 PM Matthew Knepley <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> On Mon, Dec 6, 2021 at 6:40 PM Fande Kong <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dear PETSc team, >>>>>>>>> >>>>>>>>> I am interested in a careful evaluation of PETSc GPU performance in >>>>>>>>> our INL cluster. >>>>>>>>> >>>>>>>>> Any example in PETSc that can show GPU speedup with solving a >>>>>>>>> nonlinear equation? >>>>>>>>> >>>>>>>>> I talked to Junchao; he suggested that I try SNES/tutorial/ex56. I >>>>>>>>> tried that, but I could not find any speedup using the GPU. I could >>>>>>>>> attach >>>>>>>>> some results of "log_view" later if we would like to see that. >>>>>>>>> >>>>>>>> >>>>>>>> We should note that you will only see speedup in the solver, so that >>>>>>>> problem has to be pretty large. I believe Mark has good results with >>>>>>>> it. >>>>>>>> The assembly is still all on the CPU. I am working on this over >>>>>>>> break, and hope to have a CEED version of it by the new year. >>>>>>>> >>>>>>> >>>>>>> Are both function and matrix assmelies on CPU? Or just the matrix >>>>>>> assembly? >>>>>>> >>>>>> >>>>>> There is no GPU assembly right now. >>>>>> >>>>>> Matt >>>>>> >>>>>> >>>>>>> OK, I will try to check the solver part >>>>>>> >>>>>>> Thanks, again >>>>>>> >>>>>>> Fande >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>> >>>>>>>>> Appreciate any instructions/comments about running a simple PETSc >>>>>>>>> GPU example to get a speedup. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Fande >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted before they begin their >>>>>>>> experiments is infinitely more interesting than any results to which >>>>>>>> their >>>>>>>> experiments lead. >>>>>>>> -- Norbert Wiener >>>>>>>> >>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to which >>>>>> their >>>>>> experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> <http://www.cse.buffalo.edu/~knepley/> >>>> >>>
