Re: [petsc-dev] Kokkos/Crusher perforance

Barry Smith Fri, 21 Jan 2022 18:47:56 -0800

  Mark,

  Fix the logging before you run more. It will help with seeing the true 
disparity between the MatMult and the vector ops.



> On Jan 21, 2022, at 9:37 PM, Mark Adams <[email protected]> wrote:
> 
> Here is one with 2M / GPU. Getting better.
> 
> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>    Matt is correct, vectors are way too small.
> 
>    BTW: Now would be a good time to run some of the Report I benchmarks on 
> Crusher to get a feel for the kernel launch times and performance on VecOps.
> 
>    Also Report 2.
> 
>   Barry
> 
> 
>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
>> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
>> similar (mat-vec is a little faster w/o, the total is about the same, call 
>> it noise)
>> 
>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
>> cores on the node, then when using 1 core/GPU. With the same size problem of 
>> course.
>> I was thinking MatMult should be faster with just one MPI process. Oh well, 
>> worry about that later.
>> 
>> The bigger problem, and I have observed this to some extent with the Landau 
>> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
>> expensive or crazy expensive.
>> You can see (attached) and the times here that the solve is dominated by 
>> not-mat-vec:
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop                          
>>     --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu 
>> - GPU
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>> grep "MatMult              400" jac_out_00*5_8_gpuawaremp*
>> MatMult              400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
>> 0.0e+00  1 55 62 54  0  27 91100100  0 668874       0      0 0.00e+00    0 
>> 0.00e+00 100
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>> grep "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*
>> KSPSolve               2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
>> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405      0 0.00e+00    0 
>> 0.00e+00 100
>> 
>> Notes about flop counters here, 
>> * that MatMult flops are not logged as GPU flops but something is logged 
>> nonetheless.
>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are 
>> at < 1%.
>> 
>> This looks complicated, so just a single remark:
>> 
>> My understanding of the benchmarking of vector ops led by Hannah was that 
>> you needed to be much
>> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I 
>> would think you would be
>> at 10% of peak or something right off the bat at these sizes. Barry, is that 
>> right?
>> 
>>   Thanks,
>> 
>>      Matt
>>  
>> Anway, not sure how to proceed but I thought I would share.
>> Maybe ask the Kokkos guys if the have looked at Crusher.
>> 
>> Mark
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
> 
> <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt>

Re: [petsc-dev] Kokkos/Crusher perforance

Reply via email to