Sorry, what’s surprising about this? 40 mpi ranks on a single node should be 
similar performance as 40 threads. Both petsc and taco are doing a row-based 
parallelism strategy so it should line up.

Rohan Yadav 

> On Dec 11, 2021, at 6:44 PM, Junchao Zhang <[email protected]> wrote:
> 
> 
> 
>> On Sat, Dec 11, 2021 at 5:09 PM Rohan Yadav <[email protected]> wrote:
>> > Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to 
>> > 1 thread or 40 threads of TACO?
>> 
>> The 1 rank time is the same as taco 1 thread, and the 40 rank time is the 
>> same as taco 40 threads.
> Interesting. TACO is supposed to give an optimized SpMV. 
>  
>> 
>> Rohan
>> 
>>> On Sat, Dec 11, 2021 at 6:07 PM Junchao Zhang <[email protected]> 
>>> wrote:
>>> 
>>> 
>>>> On Sat, Dec 11, 2021, 4:22 PM Rohan Yadav <[email protected]> wrote:
>>>> Thanks all for the help, the main problem was the lack of optimization 
>>>> flags in the default build provided by my system. A manual installation 
>>>> with optimization flags delivers performance equal to the single node 
>>>> benchmark I discussed before.
>>> 
>>> Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to 1 
>>> thread or 40 threads of TACO?
>>>> 
>>>> Rohan
>>>> 
>>>>> On Sat, Dec 11, 2021 at 4:04 PM Rohan Yadav <[email protected]> wrote:
>>>>> > The matrix market file in text format is not good for load.  One should 
>>>>> > convert it to petsc binary format (only once), and use the new binary 
>>>>> > file  afterwards. 
>>>>> 
>>>>> Yes, I understand this. The point I'm trying to make is that using PETSc 
>>>>> to even perform the initial conversion from matrix market to the binary 
>>>>> format was prohibitively slow using `MatSetValues`.
>>>>> 
>>>>> > I meant 10 lines of code without any function call, which can be 
>>>>> > thought of as a textbook implementation of SpMV. As a baseline, one can 
>>>>> > apply optimizations to it.  PETSc does not do sophisticated sparse 
>>>>> > matrix optimization itself, instead it relies on third-party libraries. 
>>>>> >  I remember we had OSKI from Berkeley for CPU, and on GPU we use 
>>>>> > cuSparse, hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then 
>>>>> > petsc can add an interface to it too.
>>>>> 
>>>>> Yes, this is what I expected. Given that PETSc uses high-performance 
>>>>> kernels for for the sparse matrix operation itself, I was surprised to 
>>>>> see that the single-thread performance of PETSc to be closer to a 
>>>>> baseline like TACO. This performance will likely improve when I compile 
>>>>> PETSc with optimization flags.
>>>>> 
>>>>> Rohan
>>>>> 
>>>>>> On Sat, Dec 11, 2021 at 1:04 PM Junchao Zhang <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <[email protected]> 
>>>>>>> wrote:
>>>>>>> Hi Junchao,
>>>>>>> 
>>>>>>> Thanks for the response!
>>>>>>> 
>>>>>>> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to 
>>>>>>> > convert a Matrix Market file into a petsc binary file. And then in 
>>>>>>> > your test, load the binary matrix, following this example 
>>>>>>> > https://petsc.org/main/src/mat/tutorials/ex1.c.html
>>>>>>> 
>>>>>>> I tried an example like this, but the performance was too slow (it 
>>>>>>> would process ~2000-3000 calls to `SetValue` a second), which is not 
>>>>>>> reasonable for loading matrices with millions of non-zeros.
>>>>>> The matrix market file in text format is not good for load.  One should 
>>>>>> convert it to petsc binary format (only once), and use the new binary 
>>>>>> file  afterwards. 
>>>>>>  
>>>>>>> 
>>>>>>> > I don't know what "No Races" means, but it seems you'd better also 
>>>>>>> > verify the result of SpMV. 
>>>>>>> 
>>>>>>> This is a correct implementation of SpMV. The no-races is fine as it 
>>>>>>> parallelizes over the rows of the matrix, and thus does not need 
>>>>>>> synchronization between writes to the output.
>>>>>>> 
>>>>>>> > You can think petsc's default CSR spmv is the baseline,  which is 
>>>>>>> > done in ~10 lines of code. 
>>>>>>> 
>>>>>>> I'm sorry, but I don't think that is a reasonable statement w.r.t to 
>>>>>>> the lines of code making it a good baseline. The TACO compiler also can 
>>>>>>> be used in 10 lines of code to compute an SpMV, or any other 
>>>>>>> state-of-the-art library could wrap an SpMV implementation behind a 
>>>>>>> single function call. I'm wondering if this performance I'm seeing 
>>>>>>> using PETSc is expected, or if I've misconfigured or am misusing the 
>>>>>>> system in some way.
>>>>>> I meant 10 lines of code without any function call, which can be thought 
>>>>>> of as a textbook implementation of SpMV. As a baseline, one can apply 
>>>>>> optimizations to it.  PETSc does not do sophisticated sparse matrix 
>>>>>> optimization itself, instead it relies on third-party libraries.  I 
>>>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse, 
>>>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can 
>>>>>> add an interface to it too.
>>>>>>  
>>>>>>> Rohan
>>>>>>> 
>>>>>>> 
>>>>>>>> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang 
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi, I’m Rohan, a student working on compilation techniques for 
>>>>>>>>> distributed tensor computations. I’m looking at using PETSc as a 
>>>>>>>>> baseline for experiments I’m running, and want to understand if I’m 
>>>>>>>>> using PETSc as it was intended to achieve high performance, and if 
>>>>>>>>> the performance I’m seeing is expected. Currently, I’m just looking 
>>>>>>>>> at SpMV operations.
>>>>>>>>> 
>>>>>>>>> My experiments are run on the Lassen Supercomputer 
>>>>>>>>> (https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40 
>>>>>>>>> CPUs, 4 V100s and an Infiniband interconnect. A visualization of the 
>>>>>>>>> architecture is here: 
>>>>>>>>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png.
>>>>>>>>> 
>>>>>>>>> As of now, I’m trying to understand the single-node performance of 
>>>>>>>>> PETSc, as the scaling performance onto multiple nodes appears to be 
>>>>>>>>> as I expect. I’m using the arabic-2005 sparse matrix from the 
>>>>>>>>> SuiteSparse matrix collection, detailed here: 
>>>>>>>>> https://sparse.tamu.edu/LAW/arabic-2005. As a trusted baseline, I am 
>>>>>>>>> comparing against SpMV code generated by the TACO compiler 
>>>>>>>>> (http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races).
>>>>>>>> I don't know what "No Races" means, but it seems you'd better also 
>>>>>>>> verify the result of SpMV. 
>>>>>>>>> 
>>>>>>>>> My experiments find that PETSc is roughly 4 times slower on a single 
>>>>>>>>> thread and node than the kernel generated by TACO:
>>>>>>>>> 
>>>>>>>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>>>>>>>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>>>>>>>> You can think petsc's default CSR spmv is the baseline,  which is done 
>>>>>>>> in ~10 lines of code. 
>>>>>>>>> 
>>>>>>>>> My code using PETSc is here: 
>>>>>>>>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38.
>>>>>>>>> 
>>>>>>>>> Runs from 1 thread and 1 node with -log_view are attached to the 
>>>>>>>>> email. The command lines for each were as follows:
>>>>>>>>> 
>>>>>>>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 
>>>>>>>>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 
>>>>>>>>> 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> In addition to these benchmarking concerns, I wanted to share my 
>>>>>>>>> experiences trying to load data from Matrix Market files into PETSc, 
>>>>>>>>> which ended up 1being much more difficult than I anticipated. 
>>>>>>>>> Essentially, trying to iterate through the Matrix Market files and 
>>>>>>>>> using `write` to insert entries into a `Mat` was extremely slow. In 
>>>>>>>>> order to get reasonable performance, I had to use an external utility 
>>>>>>>>> to basically construct a CSR matrix, and then pass the arrays from 
>>>>>>>>> the CSR Matrix into `MatCreateSeqAIJWithArrays`. I couldn’t find any 
>>>>>>>>> more guidance on PETSc forums or Google, so I wanted to know if this 
>>>>>>>>> was the right way to go.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Rohan Yadav

Reply via email to