On Dec 22, 2020, at 3:38 PM, Mark Adams
<[email protected]<mailto:[email protected]>> wrote:
I am MPI serial LU solving a smallish matrix (2D, Q3, 8K equations) on a Summit
node (42 P9 cores, 6 V100 GPUs) using cuSparse and Kokkos kernels. The cuSparse
performance is terrible.
I solve the same TS problem in MPI serial on each global process. I run with
NP=1 or (all) 7 cores/MPI per GPU:
MatLUFactorNum time, using all 6 GPUs:
NP/GPU cuSparse Kokkos kernels
1 0.12 0.075
7 0.55 0.072 // some noise here
So cuSparse is about 2x slower on one process and 8x slower when using all the
cores, from memory contention I assume.
I found that the problem is in MatSeqAIJCUSPARSEBuildILULower[Upper]TriMatrix.
Most of this excess time is in:
cerr = cudaMallocHost((void**) &AALo,
nzLower*sizeof(PetscScalar));CHKERRCUDA(cerr);
and
cerr = cudaFreeHost(AALo);CHKERRCUDA(cerr);
nzLower is about 140K. Here is my timer data, in a stage after a "warm up
stage":
Inner-MatSeqAIJCUSPARSEBuildILULowerTriMatrix 12 1.0 2.3514e-01 1.1
0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 23 0 0 0 0 0
0 12 1.34e+01 0 0.00e+00 0
MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaMallocHost 12 1.0
1.5448e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 15 0 0 0
0 0 0 0 0.00e+00 0 0.00e+00 0
MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaFreeHost 12 1.0
8.3908e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 8 0 0 0
0 0 0 0 0.00e+00 0 0.00e+00 0
Allocation/free of pinned memory is slow, usually on the order of several
milliseconds. So these numbers look normal. Is there any opportunity to reuse
the pinned memory in these functions?
Hong (Mr.)
This 0.23 sec happens in Upper also, for a total of ~0.46, which pretty much
matches the difference with Kokkos.
Any ideas?
Thanks,
Mark