> On Dec 11, 2021, at 11:52 AM, Rohan Yadav <[email protected]> wrote: > > Thanks Barry! > > > The flop rates for the sparse matrix-vector product are very low for > > an IBM Power 9. This is probably, at least partially, because the code is > > configured without any optimization flags. You should run ./configure with > > additional options something like COPTFLAGS="-O3" CXXOPTFLAGS="-O3" > > FOPTFLAGS="-O3" but please consult the IBM documentation to determine > > exactly what optimization flags to use for mpixlc and mpixlf. > > This is a great catch! I was using the pre-built petsc provided on Lassen, so > I'm very surprised that it wasn't built with optimizations. I'll try building > with optimizations enabled and see what the performance is. > > > When running in parallel I would expect the "sweet spot" of optimal > > performance to be roughly around 20 MPI ranks since the memory bandwidth of > > the CPU will be saturated long before you reach 40 ranks. I would recommend > > running with 1, 2, 3, 4, .... ranks to determine the optimal number of > > ranks. Also please consult the documentation on the placement of the ranks > > into the cores of the CPU; it is crucial to get this right and likely the > > default is far from correct. Essentially you want each core used to be as > > far away from the other cores being used as possible to maximize the > > achievable memory bandwidth. So the first core should be on the first > > socket, the second core on the second socket, the third core back on the > > first socket far from the first core (that is it should not share L1 or L2 > > cache with the first core), etc. > > I did a sweep of rank counts already and found that 40 is the best performing > on this system.
It may be different with the optimization turned on. I am surprised that it is 40 usually it is lower. > > > The arabic-2005 matrix is not at all representative of the types of > > matrices PETSc is designed to solve. It does not come from a PDE and does > > not have the stencil structure of a matrix that comes from a PDE. PETSc's > > performance on such a matrix will be much lower than its performance for > > PDE matrices since PETSc is not designed for this type of matrix. Depending > > on the goals of your work you may want to use different matrices that come > > from PDEs. > > I'm probably not using PETSc for solvers right now, but more so for > distributed sparse linear algebra operations. Is the matrix structure going > to affect PETSc's performance that much for these kinds of operations? Yes, especially with multiple MPI ranks. The reason is that for the arabic-2005 like graphs the PETSc parallel CSR split by rows across MPI ranks is not a good layout of the data, it induces a lot of communication. > > Rohan > > > On Sat, Dec 11, 2021 at 11:40 AM Barry Smith <[email protected] > <mailto:[email protected]>> wrote: > > Rohan, > > The flop rates for the sparse matrix-vector product are very low for an > IBM Power 9. This is probably, at least partially, because the code is > configured without any optimization flags. You should run ./configure with > additional options something like COPTFLAGS="-O3" CXXOPTFLAGS="-O3" > FOPTFLAGS="-O3" but please consult the IBM documentation to determine exactly > what optimization flags to use for mpixlc and mpixlf. > > When running in parallel I would expect the "sweet spot" of optimal > performance to be roughly around 20 MPI ranks since the memory bandwidth of > the CPU will be saturated long before you reach 40 ranks. I would recommend > running with 1, 2, 3, 4, .... ranks to determine the optimal number of ranks. > Also please consult the documentation on the placement of the ranks into the > cores of the CPU; it is crucial to get this right and likely the default is > far from correct. Essentially you want each core used to be as far away from > the other cores being used as possible to maximize the achievable memory > bandwidth. So the first core should be on the first socket, the second core > on the second socket, the third core back on the first socket far from the > first core (that is it should not share L1 or L2 cache with the first core), > etc. > > The arabic-2005 matrix is not at all representative of the types of > matrices PETSc is designed to solve. It does not come from a PDE and does not > have the stencil structure of a matrix that comes from a PDE. PETSc's > performance on such a matrix will be much lower than its performance for PDE > matrices since PETSc is not designed for this type of matrix. Depending on > the goals of your work you may want to use different matrices that come from > PDEs. > > Regarding loading the matrix. Yes, it is expected that one uses a custom > stand-along utility to read in SuiteSparse formatted matrices and converts > them to the PETSc binary format; we do have a couple of examples of how such > code can be written in src/mat/tutorials or tests > > > Barry > > >> On Dec 10, 2021, at 6:54 PM, Rohan Yadav <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi, I’m Rohan, a student working on compilation techniques for distributed >> tensor computations. I’m looking at using PETSc as a baseline for >> experiments I’m running, and want to understand if I’m using PETSc as it was >> intended to achieve high performance, and if the performance I’m seeing is >> expected. Currently, I’m just looking at SpMV operations. >> >> My experiments are run on the Lassen Supercomputer >> (https://hpc.llnl.gov/hardware/platforms/lassen >> <https://hpc.llnl.gov/hardware/platforms/lassen>). The system has 40 CPUs, 4 >> V100s and an Infiniband interconnect. A visualization of the architecture is >> here: >> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png >> <https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png>. >> >> As of now, I’m trying to understand the single-node performance of PETSc, as >> the scaling performance onto multiple nodes appears to be as I expect. I’m >> using the arabic-2005 sparse matrix from the SuiteSparse matrix collection, >> detailed here: https://sparse.tamu.edu/LAW/arabic-2005 >> <https://sparse.tamu.edu/LAW/arabic-2005>. As a trusted baseline, I am >> comparing against SpMV code generated by the TACO compiler >> (http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races) >> >> <http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)>. >> >> My experiments find that PETSc is roughly 4 times slower on a single thread >> and node than the kernel generated by TACO: >> >> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms. >> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms. >> >> My code using PETSc is here: >> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38 >> >> <https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38>. >> >> Runs from 1 thread and 1 node with -log_view are attached to the email. The >> command lines for each were as follows: >> >> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup >> 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 >> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >> >> >> In addition to these benchmarking concerns, I wanted to share my experiences >> trying to load data from Matrix Market files into PETSc, which ended up >> 1being much more difficult than I anticipated. Essentially, trying to >> iterate through the Matrix Market files and using `write` to insert entries >> into a `Mat` was extremely slow. In order to get reasonable performance, I >> had to use an external utility to basically construct a CSR matrix, and then >> pass the arrays from the CSR Matrix into `MatCreateSeqAIJWithArrays`. I >> couldn’t find any more guidance on PETSc forums or Google, so I wanted to >> know if this was the right way to go. >> >> Thanks, >> >> Rohan Yadav >> <petsc-1-node-1-thread.txt><petsc-1-node-40-threads.txt> >
