Re: [petsc-dev] Questions around benchmarking and data loading with PETSc

Junchao Zhang Sat, 11 Dec 2021 15:44:13 -0800

On Sat, Dec 11, 2021 at 5:09 PM Rohan Yadav <[email protected]> wrote:


> > Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close
> to 1 thread or 40 threads of TACO?
>
> The 1 rank time is the same as taco 1 thread, and the 40 rank time is the
> same as taco 40 threads.
>
Interesting. TACO is supposed to give an optimized SpMV.


>
> Rohan
>
> On Sat, Dec 11, 2021 at 6:07 PM Junchao Zhang <[email protected]>
> wrote:
>
>>
>>
>> On Sat, Dec 11, 2021, 4:22 PM Rohan Yadav <[email protected]> wrote:
>>
>>> Thanks all for the help, the main problem was the lack of optimization
>>> flags in the default build provided by my system. A manual installation
>>> with optimization flags delivers performance equal to the single node
>>> benchmark I discussed before.
>>>
>> Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to
>> 1 thread or 40 threads of TACO?
>>
>>>
>>> Rohan
>>>
>>> On Sat, Dec 11, 2021 at 4:04 PM Rohan Yadav <[email protected]>
>>> wrote:
>>>
>>>> > The matrix market file in text format is not good for load.  One
>>>> should convert it to petsc binary format (only once), and use the new
>>>> binary file  afterwards.
>>>>
>>>> Yes, I understand this. The point I'm trying to make is that using
>>>> PETSc to even perform the initial conversion from matrix market to the
>>>> binary format was prohibitively slow using `MatSetValues`.
>>>>
>>>> > I meant 10 lines of code without any function call, which can be
>>>> thought of as a textbook implementation of SpMV. As a baseline, one can
>>>> apply optimizations to it.  PETSc does not do sophisticated sparse matrix
>>>> optimization itself, instead it relies on third-party libraries.  I
>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse,
>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can add
>>>> an interface to it too.
>>>>
>>>> Yes, this is what I expected. Given that PETSc uses high-performance
>>>> kernels for for the sparse matrix operation itself, I was surprised to see
>>>> that the single-thread performance of PETSc to be closer to a baseline like
>>>> TACO. This performance will likely improve when I compile PETSc with
>>>> optimization flags.
>>>>
>>>> Rohan
>>>>
>>>> On Sat, Dec 11, 2021 at 1:04 PM Junchao Zhang <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Junchao,
>>>>>>
>>>>>> Thanks for the response!
>>>>>>
>>>>>> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to
>>>>>> convert a Matrix Market file into a petsc binary file. And then in
>>>>>> your test, load the binary matrix, following this example
>>>>>> https://petsc.org/main/src/mat/tutorials/ex1.c.html
>>>>>>
>>>>>> I tried an example like this, but the performance was too slow (it
>>>>>> would process ~2000-3000 calls to `SetValue` a second), which is not
>>>>>> reasonable for loading matrices with millions of non-zeros.
>>>>>>
>>>>> The matrix market file in text format is not good for load.  One
>>>>> should convert it to petsc binary format (only once), and use the new
>>>>> binary file  afterwards.
>>>>>
>>>>>
>>>>>>
>>>>>> > I don't know what "No Races" means, but it seems you'd better also
>>>>>> verify the result of SpMV.
>>>>>>
>>>>>> This is a correct implementation of SpMV. The no-races is fine as it
>>>>>> parallelizes over the rows of the matrix, and thus does not need
>>>>>> synchronization between writes to the output.
>>>>>>
>>>>>> > You can think petsc's default CSR spmv is the baseline,  which is
>>>>>> done in ~10 lines of code.
>>>>>>
>>>>>> I'm sorry, but I don't think that is a reasonable statement w.r.t to
>>>>>> the lines of code making it a good baseline. The TACO compiler also can 
>>>>>> be
>>>>>> used in 10 lines of code to compute an SpMV, or any other 
>>>>>> state-of-the-art
>>>>>> library could wrap an SpMV implementation behind a single function call.
>>>>>> I'm wondering if this performance I'm seeing using PETSc is expected, or 
>>>>>> if
>>>>>> I've misconfigured or am misusing the system in some way.
>>>>>>
>>>>> I meant 10 lines of code without any function call, which can be
>>>>> thought of as a textbook implementation of SpMV. As a baseline, one can
>>>>> apply optimizations to it.  PETSc does not do sophisticated sparse matrix
>>>>> optimization itself, instead it relies on third-party libraries.  I
>>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse,
>>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can 
>>>>> add
>>>>> an interface to it too.
>>>>>
>>>>>
>>>>>> Rohan
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, I’m Rohan, a student working on compilation techniques for
>>>>>>>> distributed tensor computations. I’m looking at using PETSc as a 
>>>>>>>> baseline
>>>>>>>> for experiments I’m running, and want to understand if I’m using PETSc 
>>>>>>>> as
>>>>>>>> it was intended to achieve high performance, and if the performance I’m
>>>>>>>> seeing is expected. Currently, I’m just looking at SpMV operations.
>>>>>>>>
>>>>>>>>
>>>>>>>> My experiments are run on the Lassen Supercomputer (
>>>>>>>> https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40
>>>>>>>> CPUs, 4 V100s and an Infiniband interconnect. A visualization of the
>>>>>>>> architecture is here:
>>>>>>>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png
>>>>>>>> .
>>>>>>>>
>>>>>>>>
>>>>>>>> As of now, I’m trying to understand the single-node performance of
>>>>>>>> PETSc, as the scaling performance onto multiple nodes appears to be as 
>>>>>>>> I
>>>>>>>> expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse 
>>>>>>>> matrix
>>>>>>>> collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005.
>>>>>>>> As a trusted baseline, I am comparing against SpMV code generated by 
>>>>>>>> the
>>>>>>>> TACO compiler (
>>>>>>>> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
>>>>>>>> .
>>>>>>>>
>>>>>>> I don't know what "No Races" means, but it seems you'd better also
>>>>>>> verify the result of SpMV.
>>>>>>>
>>>>>>>>
>>>>>>>> My experiments find that PETSc is roughly 4 times slower on a
>>>>>>>> single thread and node than the kernel generated by TACO:
>>>>>>>>
>>>>>>>>
>>>>>>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>>>>>>>>
>>>>>>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>>>>>>>>
>>>>>>> You can think petsc's default CSR spmv is the baseline,  which is
>>>>>>> done in ~10 lines of code.
>>>>>>>
>>>>>>>>
>>>>>>>> My code using PETSc is here:
>>>>>>>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
>>>>>>>> .
>>>>>>>>
>>>>>>>>
>>>>>>>> Runs from 1 thread and 1 node with -log_view are attached to the
>>>>>>>> email. The command lines for each were as follows:
>>>>>>>>
>>>>>>>>
>>>>>>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20
>>>>>>>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>
>>>>>>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n
>>>>>>>> 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In addition to these benchmarking concerns, I wanted to share my
>>>>>>>> experiences trying to load data from Matrix Market files into PETSc, 
>>>>>>>> which
>>>>>>>> ended up 1being much more difficult than I anticipated. Essentially, 
>>>>>>>> trying
>>>>>>>> to iterate through the Matrix Market files and using `write` to insert
>>>>>>>> entries into a `Mat` was extremely slow. In order to get reasonable
>>>>>>>> performance, I had to use an external utility to basically construct a 
>>>>>>>> CSR
>>>>>>>> matrix, and then pass the arrays from the CSR Matrix into
>>>>>>>> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc
>>>>>>>> forums or Google, so I wanted to know if this was the right way to go.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> Rohan Yadav
>>>>>>>>
>>>>>>>

Re: [petsc-dev] Questions around benchmarking and data loading with PETSc

Reply via email to