> On Dec 30, 2020, at 7:30 PM, Jed Brown <[email protected]> wrote:
>
> Barry Smith <[email protected]> writes:
>
>> If you are using direct solvers on each block on each GPU (several matrices
>> on each GPU) you could pull apart, for example, MatSolve_SeqAIJCUSPARSE()
>> and launch each of the matrix solves on a separate stream. You could use a
>> MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model. Maybe a
>> couple hours coding to produce a prototype MatSolveBegin/MatSolveEnd from
>> MatSolve_SeqAIJCUSPARSE.
>
> I doubt cusparse_solve is a single kernel launch (and there's two of them
> already). You'd almost certainly need a thread to keep driving it, or an
> async/await model. Begin/End pairs for compute (even "offloaded") compute are
> no small change.
Why, it can simply launch the 4 non-blocking kernels needed in the same
stream for a given matrix and then go to the next matrix and do the same in the
next stream. If the GPU is smarter enough to manage utilizing the multiple
streams I don't see why any baby-sitting by the CPU is needed at all. Note
there is no CPU work needed between each of the 4 kernels that I can see.
>
>> Note pulling apart a non-coupled single MatAIJ that contains all the
>> matrices would be hugely expensive. Better to build each matrix already
>> separate or use MatNest with only diagonal matrices.
>
> Nonsense, the ND will notice that they're decoupled and you get more meat per
> kernel launch.
Yes, if the underlying GPU factorization and solver can take advantage of
this you are of course completely correct. It would be a good test of
SuperLU_DIST GPU to just give it the uncoupled big matrix and see how it does
with profiling on the GPU. It is playing the "I have information I know that I
throw away and then expect the software to recover model" game.