PETSc code could check for the environmental variable CUDA_VISIBLE_DEVICES=-1 if that makes sense to resolve the situation.
> On Nov 1, 2021, at 11:43 AM, Jacob Faibussowitsch <[email protected]> wrote: > > Looks like you are tripping up the following: > > cerr = cupmGetDeviceCount(&ndev); > if (PetscUnlikely(cerr == cupmErrorStubLibrary)) { > … // handle missing driver or stub library > } else {CHKERRCUPM(cerr);} // your error here > > Is it an error if a user configures with cuda (i.e. signals intent to use > cuda) but disables all the devices? On the one hand, yes this can be > considered an error if the user inadvertently disables the devices via this > environment variable without knowing, but on the other hand they should be > able to freely set this variable without petsc crashing… Should we warn > users? Handle this silently? > > Note that petsc does provide '-device_enable none’ option to disable all > devices, or if you only want to disable cuda devices '-device_enable_cuda > none’ which should achieve the same effect as CUDA_VISIBLE_DEVICES=-1. But > maybe it is too obscure to ask users to know about and use these flags > instead of setting the cuda env variables. (Btw, can you test that using > ‘-device_enable_cuda none’ does not crash when setting > CUDA_VISIBLE_DEVICES=-1?) > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > >> On Nov 1, 2021, at 10:09, Stefano Zampini <[email protected] >> <mailto:[email protected]>> wrote: >> >> Just found out that if we configure with cuda and then want to run on CPU >> only using CUDA_VISIBLE_DEVICES=-1 PETSc errors out. Is this intended >> behavior? I supposed it should work >> This is with main >> >> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check >> Running check examples to verify correct installation >> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and >> PETSC_ARCH=arch-ecrcml-cuda-double >> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process >> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes >> C/C++ example src/snes/tutorials/ex19 run successfully with cuda >> Completed test examples >> >> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check >> CUDA_VISIBLE_DEVICES=1 >> Running check examples to verify correct installation >> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and >> PETSC_ARCH=arch-ecrcml-cuda-double >> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process >> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes >> C/C++ example src/snes/tutorials/ex19 run successfully with cuda >> Completed test examples >> >> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check >> CUDA_VISIBLE_DEVICES=-1 >> Running check examples to verify correct installation >> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and >> PETSC_ARCH=arch-ecrcml-cuda-double >> Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI process >> See http://www.mcs.anl.gov/petsc/documentation/faq.html >> <http://www.mcs.anl.gov/petsc/documentation/faq.html> >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [0]PETSC ERROR: GPU error >> [0]PETSC ERROR: cuda error 100 (cudaErrorNoDevice) : no CUDA-capable device >> is detected >> [0]PETSC ERROR: See https://petsc.org/release/faq/ >> <https://petsc.org/release/faq/> for trouble shooting. >> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-368-g72b201b202 GIT >> Date: 2021-10-29 14:48:19 +0300 >> [0]PETSC ERROR: ./ex19 on a arch-ecrcml-cuda-double named >> qaysar.kaust.edu.sa <http://qaysar.kaust.edu.sa/> by zampins Mon Nov 1 >> 18:06:12 2021 >> [0]PETSC ERROR: Configure options >> --with-blaslapack-include=/home/zampins/miniforge/envs/ecrcml-cuda/include >> --with-blaslapack-lib=/home/zampins/miniforge/envs/ecrcml-cuda/lib/libmkl_rt.so >> --download-h2opus --with-cuda >> --with-kblas-dir=/home/zampins/miniforge/envs/ecrcml-cuda >> --with-magma-dir=/home/zampins/miniforge/envs/ecrcml-cuda >> --LDFLAGS=/usr/lib/x86_64-linux-gnu/libcuda.so --with-debugging=1 >> --with-openmp --with-precision=double --with-fc=0 >> PETSC_ARCH=arch-ecrcml-cuda-double >> PETSC_DIR=/home/zampins/miniforge/Devel/petsc >> [0]PETSC ERROR: #1 initialize() at >> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:302 >> [0]PETSC ERROR: #2 PetscDeviceInitializeTypeFromOptions_Private() at >> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:292 >> [0]PETSC ERROR: #3 PetscDeviceInitializeFromOptions_Internal() at >> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:417 >> [0]PETSC ERROR: #4 PetscInitialize_Common() at >> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:956 >> [0]PETSC ERROR: #5 PetscInitialize() at >> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:1231 >> -------------------------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code. Per user-direction, the job has been aborted. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> >> [ >
