I am testing my Landau code, which is MPI serial, but with many independent MPI processes driving each GPU, in an MPI parallel harness code (Landau ex2).
Vector operations with Kokkos Kernels and cuSparse are about the same (KK is faster) and a bit expensive with one process / GPU. About the same as my Jacobian construction, which is expensive but optimized on the GPU. (I am using arkimex adaptive TS. I am guessing that it does a lot of vector ops, because there are a lot.) With 14 or 15 processes, all doing the same MPI serial problem, cuSparse is about 2.5x more expensive than KK. KK does degrad by about 15% from the one processor case. So KK is doing fine, but something bad is happening with cuSparse. Anyone have any thoughts on this? Thanks, Mark
