Ah, so first first maybe the Solver does not need a stream because it uses VecScatter ?
The model is 1) block to get your data, 2) doit non-blocking. If a vector is used in a non-GPU method that does not know about (1) you are hosed, but that is a detail. I like the middle, single loop algo or whatever is easier, keep it simple, and once it's running it is easy to an experiment with more loops. I could try this with fieldsplit/additive and Jacob's Vector object on Cuda, with Jacobi now of its ready. Jacob? On Sun, Jan 10, 2021 at 9:38 PM Barry Smith <[email protected]> wrote: > > > On Jan 10, 2021, at 1:30 PM, Mark Adams <[email protected]> wrote: > > > > On Sat, Jan 9, 2021 at 11:14 PM Barry Smith <[email protected]> wrote: > >> >> Ok, fieldsplit, pretty much the same as what I said for block Jacobi. >> >> static PetscErrorCode PCApply_FieldSplit(PC pc,Vec x,Vec y) >> { >> PC_FieldSplit *jac = (PC_FieldSplit*)pc->data; >> PetscErrorCode ierr; >> PC_FieldSplitLink ilink = jac->head; >> PetscInt cnt,bs; >> >> PetscFunctionBegin; >> if (jac->type == PC_COMPOSITE_ADDITIVE) { >> if (jac->defaultsplit) { >> ierr = VecGetBlockSize(x,&bs);CHKERRQ(ierr); >> if (jac->bs > 0 && bs != jac->bs) >> SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize >> of x vector %D does not match fieldsplit blocksize %D",bs,jac->bs); >> ierr = VecGetBlockSize(y,&bs);CHKERRQ(ierr); >> if (jac->bs > 0 && bs != jac->bs) >> SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize >> of y vector %D does not match fieldsplit blocksize %D",bs,jac->bs); >> ierr = VecStrideGatherAll(x,jac->x,INSERT_VALUES);CHKERRQ(ierr); >> while (ilink) { >> ierr = >> PetscLogEventBegin(ilink->event,ilink->ksp,ilink->x,ilink->y,NULL);CHKERRQ(ierr); >> ierr = KSPSolve(ilink->ksp,ilink->x,ilink->y);CHKERRQ(ierr); >> >> you need these KSPSolve() to not block which if you are using >> SuperLU_DIST means each call to >> >> PetscStackCall("SuperLU_DIST:PStatInit",PStatInit(&stat)); /* >> Initialize the statistics variables. */ >> #if defined(PETSC_USE_COMPLEX) >> >> PetscStackCall("SuperLU_DIST:pzgssvx",pzgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,(doublecomplex*)bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info)); >> #else >> >> PetscStackCall("SuperLU_DIST:pdgssvx",pdgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info)); >> #endif >> >> must be on a different stream and cannot block. Which means essentially >> the PStatInit and pdgssvx have to run completely on the GPU or have some >> complex CPU babysitting that allows them to be interlaced on the GPU while >> doing their CPU stuff. >> > > It is not clear why the library code would be that complex but I don't > understand streams well. SuperLU currently uses Thrust driven by CPU code, > but is developing a pure GPU version. I can wait for the pure GPU version. > > >> I doubt either thing is true and you are best off using the cuSparse >> Solvers which do run completely on the GPU in any stream you give them. >> > > CUSparse does not have an LU factorization. > > >> >> The factorization is even more of a nightmare, I think Sherry uses the >> GPU just to do blocks for her sequential factorization so I doubt very much >> it can handle multiple simultaneous factorizations. >> > > Oh you were talking about the solves. (I have only been thinking about > factorizations and that has caused some confusion I now see). > > I am still not clear about the parallel dispatch. If an MPI serial library > method needs a lot of complex logic to work perhaps look at other models? > The XGC code uses OMP a lot and they dispatch asynchronous "solves" in an > OMP loop. I'm not sure if being in an OMP thread buys you anything for a > GPU solver but this is what they have done for CPU solvers. At one point > they used PETSc for a LU solve and we (Barry) made some fixes to make it > thread safe enough for this operation, but don't know what GPU "solvers" > they have now running in this model on the GPU. > > Regardless, my best guess as the basic model is something like: > > > The thing is streams manage their own dependencies so you don't want or > need any CPU waits on them. Assume both VecScatter and Solve take a stream > and are not blocking. The code for CUDA GPUs could be > > for all blocks i > VecScatter(i, i.stream) // launches scatter kernel(s) > for all blocks i > Solve(i, i.stream) // launches solve kernel(s) > for all blocks i > VecScatter(i, i.stream) // launches scatter kernel(s) > > launch here means the CPU gives the stream and the code and data pointers > to the GPU to put in its queue and schedule when it can. > > Since everything is non-blocking on the CPU just zips through the loops, > taking about 10 micro-seconds for each "launch" and the GPU manages the > scheduling. You could also do > > > for all blocks i > VecScatter(i, i.stream) // launches scatter kernel(s) > Solve(i, i.stream) // launches solve kernel(s) > VecScatter(i, i.stream) // launches scatter kerkel(s) > > Again the CPU just zips through all the loops giving the GPU a big list > of kernels and the GPU manages the scheduling. The second approach would, I > am guessing, be slightly slower since the GPU gets the scatter kernels with > some time spacing between them (because the CPU is putting the later > operations in the queue) so cannot schedule them as rapidly. > > There is also > > for all blocks i > VecScatter(i, i.stream) // launches scatter kernel(s) > for all blocks i > Solve(i, i.stream) // launches solve kernel(s) > VecScatter(i, i.stream) // launches scatter kerkel(s) > > which may be as good as the first assuming the scatters take long enough > to cover the time for all the other kernels to be launched by the CPU and > queued up by the GPU. > > > > > For all blocks i, > create a stream i.s, and launch a non-blocking vecScater with i.s > For all blocks i, > vecScaterEnd(i), > Solve(i.s, ....) // So library solvers and our built-in solvers would > need to take a stream or I guess that would get passed in through Jacob's > vector class and the library interface code would need to take the Cuda > stream out and give it to SuperLU, say. > For all blocks i, > Wait (i.s) > vecscatterBegin (i.s, ...) > For all blocks i, > vecscatterEnd (i.s, ...) // non-overlapping blocks so problem with > adding values > > >
