Since we cannot post issues (reported here https://forum.gitlab.com/t/creating-new-issue-gives-cannot-create-issue-getting-whoops-something-went-wrong-on-our-end/41966?u=bsmith) here is my issue so I don't forget it.
 I think
err  = WaitForCUDA();CHKERRCUDA(err);
ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
should be changed to include WaitForCUDA() actually WaitForDevice() inside the PetscLogGpuTimeEnd(). Currently sometimes the WaitForCUDA() is missing in a few places resulting in bad timing. Also some _SeqCUDA() don't have the PetscLogGpuTimeEnd() and need to be fixed.
The current model is a maintenance nightmare.
Does anyone see a problem with making this change?

I'm fine with this change, as the maintenance benefits outweigh the performance cost for typical use cases.

I propose to also add the WaitForDevice(); at PetscLogGpuTimeBegin(). This will ensure that no previous GPU kernel executions spill over into the timed section.

   Karl,

   When synchronization is turned on the precious GPU kernels should always have their own WaitForDevice(), so are you concerned about buggy code that does not include WaitForDevice?

I'm primarily thinking of user callback routines here. For example, a FormFunction provided by the user that is running some GPU kernels. We have no guarantee that these user kernels have completed before entering the timed sections inside PETSc, so the logs will be skewed to report an unusually slow kernel in PETSc (the one right after the user form function). Arguably we could add a WaitForDevice() after user callback invocations.

I didn't think of the WaitForDevice() after each kernel call in PETSc; with that we do get reasonable timings within PETSc (except for the user callbacks mentioned above), so the two-barrier model is not needed.

Best regards,
Karli





 Might this incur an extra overhead checking the device? Or will it always be true that if there are no outstanding kernels it will not go to the GPU and the check will return immediately?

If we want to have a two barrier model, I propose we log the timing for waiting at the first barrier separately.

Barry


Best regards,
Karli


Reply via email to