Re: [petsc-dev] Questions around benchmarking and data loading with PETSc

Barry Smith Sat, 11 Dec 2021 09:12:31 -0800


> On Dec 11, 2021, at 11:52 AM, Rohan Yadav <[email protected]> wrote:
> 
> Thanks Barry!
> 
> >      The flop rates for the sparse matrix-vector product are very low for 
> > an IBM Power 9. This is probably, at least partially, because the code is 
> > configured without any optimization flags. You should run ./configure with 
> > additional options something like COPTFLAGS="-O3"  CXXOPTFLAGS="-O3"  
> > FOPTFLAGS="-O3" but please consult the IBM documentation to determine 
> > exactly what optimization flags to use for mpixlc and mpixlf.
> 
> This is a great catch! I was using the pre-built petsc provided on Lassen, so 
> I'm very surprised that it wasn't built with optimizations. I'll try building 
> with optimizations enabled and see what the performance is.
> 
> >    When running in parallel I would expect the "sweet spot" of optimal 
> > performance to be roughly around 20 MPI ranks since the memory bandwidth of 
> > the CPU will be saturated long before you reach 40 ranks. I would recommend 
> > running with 1, 2, 3, 4, .... ranks to determine the optimal number of 
> > ranks. Also please consult the documentation on the placement of the ranks 
> > into the cores of the CPU; it is crucial to get this right and likely the 
> > default is far from correct. Essentially you want each core used to be as 
> > far away from the other cores being used as possible to maximize the 
> > achievable memory bandwidth. So the first core should be on the first 
> > socket, the second core on the second socket, the third core back on the 
> > first socket far from the first core (that is it should not share L1 or L2 
> > cache with the first core), etc.
> 
> I did a sweep of rank counts already and found that 40 is the best performing 
> on this system.


   It may be different with the optimization turned on. I am surprised that it 
is 40 usually it is lower.
> 
> > The arabic-2005  matrix is not at all representative of the types of 
> > matrices PETSc is designed to solve. It does not come from a PDE and does 
> > not have the stencil structure of a matrix that comes from a PDE. PETSc's 
> > performance on such a matrix will be much lower than its performance for 
> > PDE matrices since PETSc is not designed for this type of matrix. Depending 
> > on the goals of your work you may want to use different matrices that come 
> > from PDEs.
> 
> I'm probably not using PETSc for solvers right now, but more so for 
> distributed sparse linear algebra operations. Is the matrix structure going 
> to affect PETSc's performance that much for these kinds of operations?

   Yes, especially with multiple MPI ranks. The reason is that for the 
arabic-2005 like graphs the PETSc parallel CSR split by rows across MPI ranks 
is not a good layout of the data, it induces a lot of communication.


> 
> Rohan
> 
> 
> On Sat, Dec 11, 2021 at 11:40 AM Barry Smith <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>    Rohan,
> 
>      The flop rates for the sparse matrix-vector product are very low for an 
> IBM Power 9. This is probably, at least partially, because the code is 
> configured without any optimization flags. You should run ./configure with 
> additional options something like COPTFLAGS="-O3"  CXXOPTFLAGS="-O3"  
> FOPTFLAGS="-O3" but please consult the IBM documentation to determine exactly 
> what optimization flags to use for mpixlc and mpixlf.
> 
>     When running in parallel I would expect the "sweet spot" of optimal 
> performance to be roughly around 20 MPI ranks since the memory bandwidth of 
> the CPU will be saturated long before you reach 40 ranks. I would recommend 
> running with 1, 2, 3, 4, .... ranks to determine the optimal number of ranks. 
> Also please consult the documentation on the placement of the ranks into the 
> cores of the CPU; it is crucial to get this right and likely the default is 
> far from correct. Essentially you want each core used to be as far away from 
> the other cores being used as possible to maximize the achievable memory 
> bandwidth. So the first core should be on the first socket, the second core 
> on the second socket, the third core back on the first socket far from the 
> first core (that is it should not share L1 or L2 cache with the first core), 
> etc.
> 
>    The arabic-2005  matrix is not at all representative of the types of 
> matrices PETSc is designed to solve. It does not come from a PDE and does not 
> have the stencil structure of a matrix that comes from a PDE. PETSc's 
> performance on such a matrix will be much lower than its performance for PDE 
> matrices since PETSc is not designed for this type of matrix. Depending on 
> the goals of your work you may want to use different matrices that come from 
> PDEs.
> 
>   Regarding loading the matrix. Yes, it is expected that one uses a custom 
> stand-along utility to read in SuiteSparse formatted matrices and converts 
> them to the PETSc binary format; we do have a couple of examples of how such 
> code can be written in src/mat/tutorials or tests
> 
> 
>  Barry
> 
> 
>> On Dec 10, 2021, at 6:54 PM, Rohan Yadav <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi, I’m Rohan, a student working on compilation techniques for distributed 
>> tensor computations. I’m looking at using PETSc as a baseline for 
>> experiments I’m running, and want to understand if I’m using PETSc as it was 
>> intended to achieve high performance, and if the performance I’m seeing is 
>> expected. Currently, I’m just looking at SpMV operations.
>> 
>> My experiments are run on the Lassen Supercomputer 
>> (https://hpc.llnl.gov/hardware/platforms/lassen 
>> <https://hpc.llnl.gov/hardware/platforms/lassen>). The system has 40 CPUs, 4 
>> V100s and an Infiniband interconnect. A visualization of the architecture is 
>> here: 
>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png 
>> <https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png>.
>> 
>> As of now, I’m trying to understand the single-node performance of PETSc, as 
>> the scaling performance onto multiple nodes appears to be as I expect. I’m 
>> using the arabic-2005 sparse matrix from the SuiteSparse matrix collection, 
>> detailed here: https://sparse.tamu.edu/LAW/arabic-2005 
>> <https://sparse.tamu.edu/LAW/arabic-2005>. As a trusted baseline, I am 
>> comparing against SpMV code generated by the TACO compiler 
>> (http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
>>  
>> <http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)>.
>> 
>> My experiments find that PETSc is roughly 4 times slower on a single thread 
>> and node than the kernel generated by TACO:
>> 
>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>> 
>> My code using PETSc is here: 
>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
>>  
>> <https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38>.
>> 
>> Runs from 1 thread and 1 node with -log_view are attached to the email. The 
>> command lines for each were as follows:
>> 
>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup 
>> 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 
>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>> 
>> 
>> In addition to these benchmarking concerns, I wanted to share my experiences 
>> trying to load data from Matrix Market files into PETSc, which ended up 
>> 1being much more difficult than I anticipated. Essentially, trying to 
>> iterate through the Matrix Market files and using `write` to insert entries 
>> into a `Mat` was extremely slow. In order to get reasonable performance, I 
>> had to use an external utility to basically construct a CSR matrix, and then 
>> pass the arrays from the CSR Matrix into `MatCreateSeqAIJWithArrays`. I 
>> couldn’t find any more guidance on PETSc forums or Google, so I wanted to 
>> know if this was the right way to go.
>> 
>> Thanks,
>> 
>> Rohan Yadav
>> <petsc-1-node-1-thread.txt><petsc-1-node-40-threads.txt>
>

Re: [petsc-dev] Questions around benchmarking and data loading with PETSc

Reply via email to