Sorry, what’s surprising about this? 40 mpi ranks on a single node should be similar performance as 40 threads. Both petsc and taco are doing a row-based parallelism strategy so it should line up.
Rohan Yadav > On Dec 11, 2021, at 6:44 PM, Junchao Zhang <[email protected]> wrote: > > > >> On Sat, Dec 11, 2021 at 5:09 PM Rohan Yadav <[email protected]> wrote: >> > Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to >> > 1 thread or 40 threads of TACO? >> >> The 1 rank time is the same as taco 1 thread, and the 40 rank time is the >> same as taco 40 threads. > Interesting. TACO is supposed to give an optimized SpMV. > >> >> Rohan >> >>> On Sat, Dec 11, 2021 at 6:07 PM Junchao Zhang <[email protected]> >>> wrote: >>> >>> >>>> On Sat, Dec 11, 2021, 4:22 PM Rohan Yadav <[email protected]> wrote: >>>> Thanks all for the help, the main problem was the lack of optimization >>>> flags in the default build provided by my system. A manual installation >>>> with optimization flags delivers performance equal to the single node >>>> benchmark I discussed before. >>> >>> Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to 1 >>> thread or 40 threads of TACO? >>>> >>>> Rohan >>>> >>>>> On Sat, Dec 11, 2021 at 4:04 PM Rohan Yadav <[email protected]> wrote: >>>>> > The matrix market file in text format is not good for load. One should >>>>> > convert it to petsc binary format (only once), and use the new binary >>>>> > file afterwards. >>>>> >>>>> Yes, I understand this. The point I'm trying to make is that using PETSc >>>>> to even perform the initial conversion from matrix market to the binary >>>>> format was prohibitively slow using `MatSetValues`. >>>>> >>>>> > I meant 10 lines of code without any function call, which can be >>>>> > thought of as a textbook implementation of SpMV. As a baseline, one can >>>>> > apply optimizations to it. PETSc does not do sophisticated sparse >>>>> > matrix optimization itself, instead it relies on third-party libraries. >>>>> > I remember we had OSKI from Berkeley for CPU, and on GPU we use >>>>> > cuSparse, hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then >>>>> > petsc can add an interface to it too. >>>>> >>>>> Yes, this is what I expected. Given that PETSc uses high-performance >>>>> kernels for for the sparse matrix operation itself, I was surprised to >>>>> see that the single-thread performance of PETSc to be closer to a >>>>> baseline like TACO. This performance will likely improve when I compile >>>>> PETSc with optimization flags. >>>>> >>>>> Rohan >>>>> >>>>>> On Sat, Dec 11, 2021 at 1:04 PM Junchao Zhang <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <[email protected]> >>>>>>> wrote: >>>>>>> Hi Junchao, >>>>>>> >>>>>>> Thanks for the response! >>>>>>> >>>>>>> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to >>>>>>> > convert a Matrix Market file into a petsc binary file. And then in >>>>>>> > your test, load the binary matrix, following this example >>>>>>> > https://petsc.org/main/src/mat/tutorials/ex1.c.html >>>>>>> >>>>>>> I tried an example like this, but the performance was too slow (it >>>>>>> would process ~2000-3000 calls to `SetValue` a second), which is not >>>>>>> reasonable for loading matrices with millions of non-zeros. >>>>>> The matrix market file in text format is not good for load. One should >>>>>> convert it to petsc binary format (only once), and use the new binary >>>>>> file afterwards. >>>>>> >>>>>>> >>>>>>> > I don't know what "No Races" means, but it seems you'd better also >>>>>>> > verify the result of SpMV. >>>>>>> >>>>>>> This is a correct implementation of SpMV. The no-races is fine as it >>>>>>> parallelizes over the rows of the matrix, and thus does not need >>>>>>> synchronization between writes to the output. >>>>>>> >>>>>>> > You can think petsc's default CSR spmv is the baseline, which is >>>>>>> > done in ~10 lines of code. >>>>>>> >>>>>>> I'm sorry, but I don't think that is a reasonable statement w.r.t to >>>>>>> the lines of code making it a good baseline. The TACO compiler also can >>>>>>> be used in 10 lines of code to compute an SpMV, or any other >>>>>>> state-of-the-art library could wrap an SpMV implementation behind a >>>>>>> single function call. I'm wondering if this performance I'm seeing >>>>>>> using PETSc is expected, or if I've misconfigured or am misusing the >>>>>>> system in some way. >>>>>> I meant 10 lines of code without any function call, which can be thought >>>>>> of as a textbook implementation of SpMV. As a baseline, one can apply >>>>>> optimizations to it. PETSc does not do sophisticated sparse matrix >>>>>> optimization itself, instead it relies on third-party libraries. I >>>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse, >>>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can >>>>>> add an interface to it too. >>>>>> >>>>>>> Rohan >>>>>>> >>>>>>> >>>>>>>> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang >>>>>>>> <[email protected]> wrote: >>>>>>>>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <[email protected]> >>>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, I’m Rohan, a student working on compilation techniques for >>>>>>>>> distributed tensor computations. I’m looking at using PETSc as a >>>>>>>>> baseline for experiments I’m running, and want to understand if I’m >>>>>>>>> using PETSc as it was intended to achieve high performance, and if >>>>>>>>> the performance I’m seeing is expected. Currently, I’m just looking >>>>>>>>> at SpMV operations. >>>>>>>>> >>>>>>>>> My experiments are run on the Lassen Supercomputer >>>>>>>>> (https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40 >>>>>>>>> CPUs, 4 V100s and an Infiniband interconnect. A visualization of the >>>>>>>>> architecture is here: >>>>>>>>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png. >>>>>>>>> >>>>>>>>> As of now, I’m trying to understand the single-node performance of >>>>>>>>> PETSc, as the scaling performance onto multiple nodes appears to be >>>>>>>>> as I expect. I’m using the arabic-2005 sparse matrix from the >>>>>>>>> SuiteSparse matrix collection, detailed here: >>>>>>>>> https://sparse.tamu.edu/LAW/arabic-2005. As a trusted baseline, I am >>>>>>>>> comparing against SpMV code generated by the TACO compiler >>>>>>>>> (http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races). >>>>>>>> I don't know what "No Races" means, but it seems you'd better also >>>>>>>> verify the result of SpMV. >>>>>>>>> >>>>>>>>> My experiments find that PETSc is roughly 4 times slower on a single >>>>>>>>> thread and node than the kernel generated by TACO: >>>>>>>>> >>>>>>>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms. >>>>>>>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms. >>>>>>>> You can think petsc's default CSR spmv is the baseline, which is done >>>>>>>> in ~10 lines of code. >>>>>>>>> >>>>>>>>> My code using PETSc is here: >>>>>>>>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38. >>>>>>>>> >>>>>>>>> Runs from 1 thread and 1 node with -log_view are attached to the >>>>>>>>> email. The command lines for each were as follows: >>>>>>>>> >>>>>>>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 >>>>>>>>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >>>>>>>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n >>>>>>>>> 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >>>>>>>>> >>>>>>>>> >>>>>>>>> In addition to these benchmarking concerns, I wanted to share my >>>>>>>>> experiences trying to load data from Matrix Market files into PETSc, >>>>>>>>> which ended up 1being much more difficult than I anticipated. >>>>>>>>> Essentially, trying to iterate through the Matrix Market files and >>>>>>>>> using `write` to insert entries into a `Mat` was extremely slow. In >>>>>>>>> order to get reasonable performance, I had to use an external utility >>>>>>>>> to basically construct a CSR matrix, and then pass the arrays from >>>>>>>>> the CSR Matrix into `MatCreateSeqAIJWithArrays`. I couldn’t find any >>>>>>>>> more guidance on PETSc forums or Google, so I wanted to know if this >>>>>>>>> was the right way to go. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Rohan Yadav
