Mark, Fix the logging before you run more. It will help with seeing the true disparity between the MatMult and the vector ops.
> On Jan 21, 2022, at 9:37 PM, Mark Adams <[email protected]> wrote: > > Here is one with 2M / GPU. Getting better. > > On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <[email protected] > <mailto:[email protected]>> wrote: > > Matt is correct, vectors are way too small. > > BTW: Now would be a good time to run some of the Report I benchmarks on > Crusher to get a feel for the kernel launch times and performance on VecOps. > > Also Report 2. > > Barry > > >> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <[email protected] >> <mailto:[email protected]>> wrote: >> >> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <[email protected] >> <mailto:[email protected]>> wrote: >> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) >> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?). >> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are >> similar (mat-vec is a little faster w/o, the total is about the same, call >> it noise) >> >> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 >> cores on the node, then when using 1 core/GPU. With the same size problem of >> course. >> I was thinking MatMult should be faster with just one MPI process. Oh well, >> worry about that later. >> >> The bigger problem, and I have observed this to some extent with the Landau >> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are >> expensive or crazy expensive. >> You can see (attached) and the times here that the solve is dominated by >> not-mat-vec: >> >> ------------------------------------------------------------------------------------------------------------------------ >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu >> - GPU >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count >> Size %F >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ >> grep "MatMult 400" jac_out_00*5_8_gpuawaremp* >> MatMult 400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 >> 0.0e+00 1 55 62 54 0 27 91100100 0 668874 0 0 0.00e+00 0 >> 0.00e+00 100 >> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ >> grep "KSPSolve 2" jac_out_001*_5_8_gpuawaremp* >> KSPSolve 2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 >> 1.2e+03 4 60 62 54 61 100100100100100 208923 1094405 0 0.00e+00 0 >> 0.00e+00 100 >> >> Notes about flop counters here, >> * that MatMult flops are not logged as GPU flops but something is logged >> nonetheless. >> * The GPU flop rate is 5x the total flop rate in KSPSolve :\ >> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are >> at < 1%. >> >> This looks complicated, so just a single remark: >> >> My understanding of the benchmarking of vector ops led by Hannah was that >> you needed to be much >> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I >> would think you would be >> at 10% of peak or something right off the bat at these sizes. Barry, is that >> right? >> >> Thanks, >> >> Matt >> >> Anway, not sure how to proceed but I thought I would share. >> Maybe ask the Kokkos guys if the have looked at Crusher. >> >> Mark >> -- >> What most experimenters take for granted before they begin their experiments >> is infinitely more interesting than any results to which their experiments >> lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/> > > <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt>
