> On Dec 30, 2020, at 7:30 PM, Jed Brown <[email protected]> wrote:
> 
> Barry Smith <[email protected]> writes:
> 
>>  If you are using direct solvers on each block on each GPU (several matrices 
>> on each GPU) you could pull apart, for example, MatSolve_SeqAIJCUSPARSE()
>> and launch each of the matrix solves on a separate stream.   You could use a 
>> MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model. Maybe a 
>> couple hours coding to produce a prototype MatSolveBegin/MatSolveEnd from 
>> MatSolve_SeqAIJCUSPARSE.
> 
> I doubt cusparse_solve is a single kernel launch (and there's two of them 
> already). You'd almost certainly need a thread to keep driving it, or an 
> async/await model. Begin/End pairs for compute (even "offloaded") compute are 
> no small change. 

  Why, it can simply launch the 4 non-blocking kernels needed in the same 
stream for a given matrix and then go to the next matrix and do the same in the 
next stream. If the GPU is smarter enough to manage utilizing the multiple 
streams I don't see why any baby-sitting by the CPU is needed at all. Note 
there is no CPU work needed between each of the 4 kernels that I can see.


> 
>>  Note pulling apart a non-coupled single MatAIJ that contains all the 
>> matrices would be hugely expensive. Better to build each matrix already 
>> separate or use MatNest with only diagonal matrices.
> 
> Nonsense, the ND will notice that they're decoupled and you get more meat per 
> kernel launch.

  Yes, if the underlying GPU factorization and solver can take advantage of 
this you are of course completely correct. It would be a good test of 
SuperLU_DIST GPU to just give it the uncoupled big matrix and see how it does 
with profiling on the GPU. It is playing the "I have information I know that I 
throw away and then expect the software to recover model" game.


Reply via email to