cuda-memcheck is a valgrind clone, but like valgrind it does not report usage as it goes. Just in a report at the end.
On Fri, Jan 7, 2022 at 10:23 PM Barry Smith <[email protected]> wrote: > > Doesn't Nvidia supply a "valgrind" like tool that will allow tracking > memory usage? I'm pretty sure I've seen one; it should be able to show > memory usage as a function of time so you can see where the memory is being > allocated > > Barry > > > On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch <[email protected]> > wrote: > > it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists > across the entire running time of an application. cupm_initialize > contributes 0.36GB out of 0.73GB. > > > If I had to guess this may be the latent overhead of CUDA streams and > events, but even then 360 MB seems ludicrous. CUDA maintains a persistent > pool of streams that is not freed until cudaDeviceReset() is called. Maybe > they initialize this pool immediately on start-up of the context? AFAIK > there is no way to disable or modify this behavior. > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > > On Jan 7, 2022, at 13:23, Zhang, Hong <[email protected]> wrote: > > Apart from the 1.2GB caused by importing torch, it seems that PETSc > consumes 0.73GB CUDA memory and this overhead persists across the entire > running time of an application. cupm_initialize contributes 0.36GB out of > 0.73GB. It is still unclear what takes the remaining 0.37GB. > > The torch issue is really a mystery. If I import torch only and do some > tensor operations on GPU, it consumes only 0.004GB CUDA memory. > > > On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev < > [email protected]> wrote: > > > 1. Commenting out ierr = > __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); > in device/impls/cupm/cupmcontext.hpp:L199 > > CUDA memory: 1.575GB > CUDA memory without importing torch: 0.370GB > > This has the same effect as commenting out L437-L440 in > interface/device.cxx > > 2. Comment out these two: > . src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = > _devices[_defaultDevice]->configure();CHKERRQ(ierr);] > . src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = > _devices[_defaultDevice]->initialize();CHKERRQ(ierr);] > > CUDA memory: 1.936GB > CUDA memory without importing torch: 0.730GB > > On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <[email protected]> > wrote: > > They had no influence to the memory usage. > > ??????????????????????????????????????????????????????????????????????? > > Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line > 360 in cupmdevice.cxx as well. > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > > On Jan 7, 2022, at 12:18, Zhang, Hong <[email protected]> wrote: > > I have tried all of these. They had no influence to the memory usage. > > On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <[email protected]> > wrote: > > Initializing cutlass and cusolver does not affect the memory usage. I did > the following to turn them off: > > > Ok next things to try out in order: > > 1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 > [PetscFunctionBegin;] > Put a PetscFunctionReturn(0); right after this > > 2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = > _devices[_defaultDevice]->configure();CHKERRQ(ierr);] > Comment this out > > 3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = > _devices[_defaultDevice]->initialize();CHKERRQ(ierr);] > Comment this out > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > > On Jan 7, 2022, at 12:02, Zhang, Hong <[email protected]> wrote: > > Initializing cutlass and cusolver does not affect the memory usage. I did > the following to turn them off: > > diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp > b/src/sys/objects/device/impls/cupm/cupmcontext.hpp > index 51fed809e4d..9a5f068323a 100644 > --- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp > +++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp > @@ -199,7 +199,7 @@ inline PetscErrorCode > CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept > #if PetscDefined(USE_DEBUG) > dci->timerInUse = PETSC_FALSE; > #endif > - ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); > + //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); > PetscFunctionReturn(0); > } > > On Jan 7, 2022, at 10:53 AM, Barry Smith <[email protected]> wrote: > > > I don't think this is right. We want the device initialized by PETSc , > we just don't want the cublas and cusolve stuff initialized. In order to > see how much memory initializing the blas and solvers takes. > > So I think you need to comment things in cupminterface.hpp > like cublasCreate and cusolverDnCreate. > > Urgh, I hate C++ where huge chunks of real code are in header files. > > > > On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch <[email protected]> > wrote: > > Hit send too early… > > If you don’t want to comment out, you can also run with "-device_enable > lazy" option. Normally this is the default behavior but if -log_view or > -log_summary is provided this defaults to “-device_enable eager”. > See src/sys/objects/device/interface/device.cxx:398 > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > > On Jan 7, 2022, at 11:29, Jacob Faibussowitsch <[email protected]> > wrote: > > You need to go into the PetscInitialize() routine find where it loads the > cublas and cusolve and comment out those lines then run with -log_view > > > Comment out > > #if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || > PetscDefined(HAVE_SYCL)) > ierr = > PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr); > #endif > > At src/sys/objects/pinit.c:956 > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > > On Jan 7, 2022, at 11:24, Barry Smith <[email protected]> wrote: > > > Without log_view it does not load any cuBLAS/cuSolve immediately with > -log_view it loads all that stuff at startup. You need to go into the > PetscInitialize() routine find where it loads the cublas and cusolve and > comment out those lines then run with -log_view > > > On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev < > [email protected]> wrote: > > When PETSc is initialized, it takes about 2GB CUDA memory. This is way too > much for doing nothing. A test script is attached to reproduce the issue. > If I remove the first line "import torch", PETSc consumes about 0.73GB, > which is still significant. Does anyone have any idea about this behavior? > > Thanks, > Hong > > hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples > (caidao22/update-examples)$ python3 test.py > CUDA memory before PETSc 0.000GB > CUDA memory after PETSc 0.004GB > hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples > (caidao22/update-examples)$ python3 test.py -log_view :0.txt > CUDA memory before PETSc 0.000GB > CUDA memory after PETSc 1.936GB > > > import torch > import sys > import os > > import nvidia_smi > nvidia_smi.nvmlInit() > handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0) > info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle) > print('CUDA memory before PETSc %.3fGB' % (info.used/1e9)) > > petsc4py_path = > os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib') > sys.path.append(petsc4py_path) > import petsc4py > petsc4py.init(sys.argv) > handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0) > info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle) > print('CUDA memory after PETSc %.3fGB' % (info.used/1e9)) > > > > > > > > > > > > > > >
