On Dec 22, 2020, at 3:38 PM, Mark Adams 
<[email protected]<mailto:[email protected]>> wrote:

I am MPI serial LU solving a smallish matrix (2D, Q3, 8K equations) on a Summit 
node (42 P9 cores, 6 V100 GPUs) using cuSparse and Kokkos kernels. The cuSparse 
performance is terrible.

I solve the same TS problem in MPI serial on each global process. I run with 
NP=1 or (all) 7 cores/MPI per GPU:
MatLUFactorNum time, using all 6 GPUs:
NP/GPU cuSparse Kokkos kernels
1      0.12     0.075
7      0.55     0.072 // some noise here
So cuSparse is about 2x slower on one process and 8x slower when using all the 
cores, from memory contention I assume.

I found that the problem is in MatSeqAIJCUSPARSEBuildILULower[Upper]TriMatrix. 
Most of this excess time is in:

      cerr = cudaMallocHost((void**) &AALo, 
nzLower*sizeof(PetscScalar));CHKERRCUDA(cerr);

and

      cerr = cudaFreeHost(AALo);CHKERRCUDA(cerr);

nzLower is about 140K. Here is my timer data, in a stage after a "warm up 
stage":

   Inner-MatSeqAIJCUSPARSEBuildILULowerTriMatrix      12 1.0 2.3514e-01 1.1 
0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0  23  0  0  0  0     0       
0     12 1.34e+01    0 0.00e+00  0
   MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaMallocHost      12 1.0 
1.5448e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0  15  0  0  0 
 0     0       0      0 0.00e+00    0 0.00e+00  0
     MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaFreeHost      12 1.0 
8.3908e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   8  0  0  0 
 0     0       0      0 0.00e+00    0 0.00e+00  0

Allocation/free of pinned memory is slow, usually on the order of several 
milliseconds. So these numbers look normal. Is there any opportunity to reuse 
the pinned memory in these functions?

Hong (Mr.)

This 0.23 sec happens in Upper also, for a total of ~0.46, which pretty much 
matches the difference with Kokkos.

Any ideas?

Thanks,
Mark

Reply via email to