Re: [petsc-dev] Questions around benchmarking and data loading with PETSc

Junchao Zhang Sat, 11 Dec 2021 17:25:27 -0800

I expected TACO was better since its website says "It uses novel compiler
techniques to get performance competitive with hand-optimized kernels"


--Junchao Zhang


On Sat, Dec 11, 2021 at 5:56 PM Rohan Yadav <[email protected]> wrote:

> Sorry, what’s surprising about this? 40 mpi ranks on a single node should
> be similar performance as 40 threads. Both petsc and taco are doing a
> row-based parallelism strategy so it should line up.
>
> Rohan Yadav
>
> On Dec 11, 2021, at 6:44 PM, Junchao Zhang <[email protected]>
> wrote:
>
> 
>
> On Sat, Dec 11, 2021 at 5:09 PM Rohan Yadav <[email protected]> wrote:
>
>> > Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close
>> to 1 thread or 40 threads of TACO?
>>
>> The 1 rank time is the same as taco 1 thread, and the 40 rank time is the
>> same as taco 40 threads.
>>
> Interesting. TACO is supposed to give an optimized SpMV.
>
>
>>
>> Rohan
>>
>> On Sat, Dec 11, 2021 at 6:07 PM Junchao Zhang <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Sat, Dec 11, 2021, 4:22 PM Rohan Yadav <[email protected]> wrote:
>>>
>>>> Thanks all for the help, the main problem was the lack of optimization
>>>> flags in the default build provided by my system. A manual installation
>>>> with optimization flags delivers performance equal to the single node
>>>> benchmark I discussed before.
>>>>
>>> Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close
>>> to 1 thread or 40 threads of TACO?
>>>
>>>>
>>>> Rohan
>>>>
>>>> On Sat, Dec 11, 2021 at 4:04 PM Rohan Yadav <[email protected]>
>>>> wrote:
>>>>
>>>>> > The matrix market file in text format is not good for load.  One
>>>>> should convert it to petsc binary format (only once), and use the new
>>>>> binary file  afterwards.
>>>>>
>>>>> Yes, I understand this. The point I'm trying to make is that using
>>>>> PETSc to even perform the initial conversion from matrix market to the
>>>>> binary format was prohibitively slow using `MatSetValues`.
>>>>>
>>>>> > I meant 10 lines of code without any function call, which can be
>>>>> thought of as a textbook implementation of SpMV. As a baseline, one can
>>>>> apply optimizations to it.  PETSc does not do sophisticated sparse matrix
>>>>> optimization itself, instead it relies on third-party libraries.  I
>>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse,
>>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can 
>>>>> add
>>>>> an interface to it too.
>>>>>
>>>>> Yes, this is what I expected. Given that PETSc uses high-performance
>>>>> kernels for for the sparse matrix operation itself, I was surprised to see
>>>>> that the single-thread performance of PETSc to be closer to a baseline 
>>>>> like
>>>>> TACO. This performance will likely improve when I compile PETSc with
>>>>> optimization flags.
>>>>>
>>>>> Rohan
>>>>>
>>>>> On Sat, Dec 11, 2021 at 1:04 PM Junchao Zhang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Junchao,
>>>>>>>
>>>>>>> Thanks for the response!
>>>>>>>
>>>>>>> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to
>>>>>>> convert a Matrix Market file into a petsc binary file. And then in
>>>>>>> your test, load the binary matrix, following this example
>>>>>>> https://petsc.org/main/src/mat/tutorials/ex1.c.html
>>>>>>>
>>>>>>> I tried an example like this, but the performance was too slow (it
>>>>>>> would process ~2000-3000 calls to `SetValue` a second), which is not
>>>>>>> reasonable for loading matrices with millions of non-zeros.
>>>>>>>
>>>>>> The matrix market file in text format is not good for load.  One
>>>>>> should convert it to petsc binary format (only once), and use the new
>>>>>> binary file  afterwards.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> > I don't know what "No Races" means, but it seems you'd better also
>>>>>>> verify the result of SpMV.
>>>>>>>
>>>>>>> This is a correct implementation of SpMV. The no-races is fine as it
>>>>>>> parallelizes over the rows of the matrix, and thus does not need
>>>>>>> synchronization between writes to the output.
>>>>>>>
>>>>>>> > You can think petsc's default CSR spmv is the baseline,  which is
>>>>>>> done in ~10 lines of code.
>>>>>>>
>>>>>>> I'm sorry, but I don't think that is a reasonable statement w.r.t to
>>>>>>> the lines of code making it a good baseline. The TACO compiler also can 
>>>>>>> be
>>>>>>> used in 10 lines of code to compute an SpMV, or any other 
>>>>>>> state-of-the-art
>>>>>>> library could wrap an SpMV implementation behind a single function call.
>>>>>>> I'm wondering if this performance I'm seeing using PETSc is expected, 
>>>>>>> or if
>>>>>>> I've misconfigured or am misusing the system in some way.
>>>>>>>
>>>>>> I meant 10 lines of code without any function call, which can be
>>>>>> thought of as a textbook implementation of SpMV. As a baseline, one can
>>>>>> apply optimizations to it.  PETSc does not do sophisticated sparse matrix
>>>>>> optimization itself, instead it relies on third-party libraries.  I
>>>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse,
>>>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can 
>>>>>> add
>>>>>> an interface to it too.
>>>>>>
>>>>>>
>>>>>>> Rohan
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, I’m Rohan, a student working on compilation techniques for
>>>>>>>>> distributed tensor computations. I’m looking at using PETSc as a 
>>>>>>>>> baseline
>>>>>>>>> for experiments I’m running, and want to understand if I’m using 
>>>>>>>>> PETSc as
>>>>>>>>> it was intended to achieve high performance, and if the performance 
>>>>>>>>> I’m
>>>>>>>>> seeing is expected. Currently, I’m just looking at SpMV operations.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My experiments are run on the Lassen Supercomputer (
>>>>>>>>> https://hpc.llnl.gov/hardware/platforms/lassen). The system has
>>>>>>>>> 40 CPUs, 4 V100s and an Infiniband interconnect. A visualization of 
>>>>>>>>> the
>>>>>>>>> architecture is here:
>>>>>>>>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As of now, I’m trying to understand the single-node performance of
>>>>>>>>> PETSc, as the scaling performance onto multiple nodes appears to be 
>>>>>>>>> as I
>>>>>>>>> expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse 
>>>>>>>>> matrix
>>>>>>>>> collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005.
>>>>>>>>> As a trusted baseline, I am comparing against SpMV code generated by 
>>>>>>>>> the
>>>>>>>>> TACO compiler (
>>>>>>>>> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
>>>>>>>>> .
>>>>>>>>>
>>>>>>>> I don't know what "No Races" means, but it seems you'd better also
>>>>>>>> verify the result of SpMV.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> My experiments find that PETSc is roughly 4 times slower on a
>>>>>>>>> single thread and node than the kernel generated by TACO:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>>>>>>>>>
>>>>>>>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>>>>>>>>>
>>>>>>>> You can think petsc's default CSR spmv is the baseline,  which is
>>>>>>>> done in ~10 lines of code.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> My code using PETSc is here:
>>>>>>>>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Runs from 1 thread and 1 node with -log_view are attached to the
>>>>>>>>> email. The command lines for each were as follows:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20
>>>>>>>>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>>
>>>>>>>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark
>>>>>>>>> -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In addition to these benchmarking concerns, I wanted to share my
>>>>>>>>> experiences trying to load data from Matrix Market files into PETSc, 
>>>>>>>>> which
>>>>>>>>> ended up 1being much more difficult than I anticipated. Essentially, 
>>>>>>>>> trying
>>>>>>>>> to iterate through the Matrix Market files and using `write` to insert
>>>>>>>>> entries into a `Mat` was extremely slow. In order to get reasonable
>>>>>>>>> performance, I had to use an external utility to basically construct 
>>>>>>>>> a CSR
>>>>>>>>> matrix, and then pass the arrays from the CSR Matrix into
>>>>>>>>> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on 
>>>>>>>>> PETSc
>>>>>>>>> forums or Google, so I wanted to know if this was the right way to go.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Rohan Yadav
>>>>>>>>>
>>>>>>>>

Re: [petsc-dev] Questions around benchmarking and data loading with PETSc

Reply via email to