Ah yes the 10-100 merged. And I am not calling the *GPU timers *so the Mflops is messed up.
And, I assume WaitForCUDA blocks on this MPI process's Cuda calls here. One stream per process. This does not block with other MPI process Cuda calls. err = WaitForCUDA();CHKERRCUDA(err); * ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);* ierr = PetscLogEventEnd(MAT_CUSPARSEGenerateTranspose,A,0,0,0);CHKERRQ(ierr); Thanks, On Tue, Dec 29, 2020 at 1:12 PM Barry Smith <[email protected]> wrote: > > Mark, > > Aside from formatting are you sure there is an issue. Could it be > 100 % of the time and 100% of the flops and since there is not room for all > the digits they end up sitting on top of each other? > > Similarly could the flop rates be overlapped on top of each other? You > could try adding more digits in the print statement to make room for these > values. > > * Barry* > > > On Dec 29, 2020, at 8:32 AM, Mark Adams <[email protected]> wrote: > > I am seeing this from a GPU kernel. The % flops is messed up and the flop > rate does not look right: > > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - > GpuToCpu - GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size > Count Size %F > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > Jac-kernel 13068 1.0 1.6106e+01 1.1 6.13e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 *10100* 0 0 0 *10100* 0 0 0 *136983459* 0 0 > 0.00e+00 0 0.00e+00 100 > > I use this in landau.cu: > > ierr = PetscLogGpuFlops(flops*nip);CHKERRQ(ierr); > > Any idea what is going on here? > > Thanks, > Mark > > >
