For the near future I am going to be driving this with multiple MPI processes per GPU and the LU factorizations are the big problem. I can use the existing serial ASM solver that would call cuSparse from each MPI process. So that will run as is. That is what I need for a paper, LU factorization on the GPU and splitting these 10 solve off for SuperLU is the first step.
A user however may not want to run with 7 MPI processes per GPU (on Summit) so in that case some sort of asynchronous thing will be needed. But that is later. A code that I work with uses Kokkos and they use OpenMP to drive asynchronous GPU processes. On Wed, Dec 30, 2020 at 10:49 PM Jed Brown <[email protected]> wrote: > Mark Adams <[email protected]> writes: > > > On Wed, Dec 30, 2020 at 8:57 PM Barry Smith <[email protected]> wrote: > > > >> > >> > >> > On Dec 30, 2020, at 7:30 PM, Jed Brown <[email protected]> wrote: > >> > > >> > Barry Smith <[email protected]> writes: > >> > > >> >> If you are using direct solvers on each block on each GPU (several > >> matrices on each GPU) you could pull apart, for example, > >> MatSolve_SeqAIJCUSPARSE() > >> >> and launch each of the matrix solves on a separate stream. You > could > >> use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() > model. > >> Maybe a couple hours coding to produce a prototype > >> MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE. > >> > > >> > I doubt cusparse_solve is a single kernel launch (and there's two of > >> them already). You'd almost certainly need a thread to keep driving it, > or > >> an async/await model. Begin/End pairs for compute (even "offloaded") > >> compute are no small change. > >> > >> Why, it can simply launch the 4 non-blocking kernels needed in the > same > >> stream for a given matrix and then go to the next matrix and do the > same in > >> the next stream. If the GPU is smarter enough to manage utilizing the > >> multiple streams I don't see why any baby-sitting by the CPU is needed > at > >> all. Note there is no CPU work needed between each of the 4 kernels > that I > >> can see. > >> > > > > I agree. The GPU scheduler can partition the GPU in space and time to > keep > > it busy. For instance a simple model for my 10 solves is loop over all > > blocks, do a non-blocking Solve, and wait. My solves might fill 1/10 of > the > > GPU, say, and I get 10x speed up. I think this is theoretically possible > > and there will be inefficiency but I have noticed that my current code > > overlapps CPU and GPU work in separate MPI processes, which is just one > way > > to do things asynchronously. There are mechanisms to do this with one > > process. > > I missed that cusparseDcsrsv2_solve() supports asynchronous execution, > however it appears that it needs to do some work (launching a kernel to > inspect device memory and waiting for it to complete) to know what error to > return (at least on the factor that does not have unit diagonal). > > | Function csrsv2_solve() reports the first numerical zero, including a > structural zero. If status is 0, no numerical zero was found. Furthermore, > no numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even > if A(j,j) is zero for some j. The user needs to call > cusparseXcsrsv2_zeroPivot() to know where the numerical zero is. > > https://docs.nvidia.com/cuda/cusparse/index.html#csrsv2_solve > > As such, I remain skeptical that you can just fire off a bunch of these > without incurring significant serialization penalty. >
