Re: [petsc-dev] PETSc init eats too much CUDA memory

Zhang, Hong via petsc-dev Sat, 08 Jan 2022 10:36:25 -0800

Here is an interesting thread discussing the memory issue for PyTorch (which I 
think is also relevant to PETSc):


https://github.com/pytorch/pytorch/issues/12873

The memory overhead (for both CPU and GPU) of PyTorch is getting worse and 
worse as it evolves. A conjecture is that the CUDA kernels in the library are 
responsible for this. But the overhead for Tensorflow2 is just around 300MB 
(compare to 1.5GB for PyTorch).

According to the discussion, there has not been a good way to decrease the 
memory overhead for PyTorch yet. Someone noticed that “removing half of the 
CUDA kernels can reduce the memory usage by half."

Hong

On Jan 7, 2022, at 9:23 PM, Barry Smith <[email protected]> wrote:


  Doesn't Nvidia supply a "valgrind" like tool that will allow tracking memory 
usage? I'm pretty sure I've seen one; it should be able to show memory usage as 
a function of time so you can see where the memory is being allocated

  Barry


On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch 
<[email protected]<mailto:[email protected]>> wrote:

it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists 
across the entire running time of an application. cupm_initialize contributes 
0.36GB out of 0.73GB.

If I had to guess this may be the latent overhead of CUDA streams and events, 
but even then 360 MB seems ludicrous. CUDA maintains a persistent pool of 
streams that is not freed until cudaDeviceReset() is called. Maybe they 
initialize this pool immediately on start-up of the context? AFAIK there is no 
way to disable or modify this behavior.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 13:23, Zhang, Hong 
<[email protected]<mailto:[email protected]>> wrote:

Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes 
0.73GB CUDA memory and this overhead persists across the entire running time of 
an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is still 
unclear what takes the remaining 0.37GB.

The torch issue is really a mystery. If I import torch only and do some tensor 
operations on GPU, it consumes only 0.004GB CUDA memory.


On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev 
<[email protected]<mailto:[email protected]>> wrote:


1. Commenting out  ierr = 
__initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in 
device/impls/cupm/cupmcontext.hpp:L199

CUDA memory: 1.575GB
CUDA memory without importing torch:  0.370GB

This has the same effect as commenting out L437-L440 in interface/device.cxx

2. Comment out these two:
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]

CUDA memory: 1.936GB
CUDA memory without importing torch:   0.730GB

On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch 
<[email protected]<mailto:[email protected]>> wrote:

They had no influence to the memory usage.
???????????????????????????????????????????????????????????????????????

Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 360 in 
cupmdevice.cxx as well.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:18, Zhang, Hong 
<[email protected]<mailto:[email protected]>> wrote:

I have tried all of these. They had no influence to the memory usage.

On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch 
<[email protected]<mailto:[email protected]>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

Ok next things to try out in order:

1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 [PetscFunctionBegin;]
Put a PetscFunctionReturn(0); right after this

2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
Comment this out

3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
Comment this out

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:02, Zhang, Hong 
<[email protected]<mailto:[email protected]>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
index 51fed809e4d..9a5f068323a 100644
--- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
+++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
@@ -199,7 +199,7 @@ inline PetscErrorCode 
CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept
 #if PetscDefined(USE_DEBUG)
   dci->timerInUse = PETSC_FALSE;
 #endif
-  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
+  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
   PetscFunctionReturn(0);
 }

On Jan 7, 2022, at 10:53 AM, Barry Smith 
<[email protected]<mailto:[email protected]>> wrote:


  I don't think this is right. We want the device initialized by PETSc , we 
just don't want the cublas and cusolve stuff initialized. In order to see how 
much memory initializing the blas and solvers takes.

  So I think you need to comment things in cupminterface.hpp like cublasCreate 
and cusolverDnCreate.

  Urgh, I hate C++ where huge chunks of real code are in header files.



On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch 
<[email protected]<mailto:[email protected]>> wrote:

Hit send too early…

If you don’t want to comment out, you can also run with "-device_enable lazy" 
option. Normally this is the default behavior but if -log_view or -log_summary 
is provided this defaults to “-device_enable eager”. See 
src/sys/objects/device/interface/device.cxx:398

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:29, Jacob Faibussowitsch 
<[email protected]<mailto:[email protected]>> wrote:

You need to go into the PetscInitialize() routine find where it loads the 
cublas and cusolve and comment out those lines then run with -log_view

Comment out

#if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
PetscDefined(HAVE_SYCL))
  ierr = 
PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
#endif

At src/sys/objects/pinit.c:956

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:24, Barry Smith 
<[email protected]<mailto:[email protected]>> wrote:


Without log_view it does not load any cuBLAS/cuSolve immediately with -log_view 
it loads all that stuff at startup. You need to go into the PetscInitialize() 
routine find where it loads the cublas and cusolve and comment out those lines 
then run with -log_view


On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
<[email protected]<mailto:[email protected]>> wrote:

When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much 
for doing nothing. A test script is attached to reproduce the issue. If I 
remove the first line "import torch", PETSc consumes about 0.73GB, which is 
still significant. Does anyone have any idea about this behavior?

Thanks,
Hong


hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 0.004GB
hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py -log_view :0.txt
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 1.936GB


import torch
import sys
import os

import nvidia_smi
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))

petsc4py_path = 
os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
sys.path.append(petsc4py_path)
import petsc4py
petsc4py.init(sys.argv)
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))

Re: [petsc-dev] PETSc init eats too much CUDA memory

Reply via email to