Dear petsc-dev, we're starting to explore (with Andreas cc'd) residual assembly on GPUs. The question naturally arises: how to do GlobalToLocal and LocalToGlobal.
I have: A PetscSF describing the communication pattern. A Vec holding the data to communicate. This will have an up-to-date device pointer. I would like: PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally) work with raw device pointers. I am led to believe that modern MPIs can plug directly into device memory, so I would like to avoid copying data to the host, doing the communication there, and then going back up to the device. Given that I think that the window implementation (which just delegates the MPI for all the packing) is not considered prime time (mostly due to MPI implementation bugs, I think), I think this means implementing a version of PetscSF_Basic that can handle the pack/unpack directly on the device, and then just hands off to MPI. The next thing is how to put a higher-level interface on top of this. What, if any, suggestions are there for doing something where the top-level API is agnostic to whether the data are on the host or the device. We had thought something like: - Make PetscSF handle device pointers (possibly with new implementation?) - Make VecScatter use SF. Calling VecScatterBegin/End on a Vec with up-to-date device pointers just uses the SF directly. Have there been any thoughts about how you want to do multi-GPU interaction? Cheers, Lawrence