PETSc code could check for the environmental variable 
CUDA_VISIBLE_DEVICES=-1 if that makes sense to resolve the situation.



> On Nov 1, 2021, at 11:43 AM, Jacob Faibussowitsch <[email protected]> wrote:
> 
> Looks like you are tripping up the following:
> 
> cerr = cupmGetDeviceCount(&ndev);
> if (PetscUnlikely(cerr == cupmErrorStubLibrary)) {
>   … // handle missing driver or stub library
> } else {CHKERRCUPM(cerr);} // your error here
> 
> Is it an error if a user configures with cuda (i.e. signals intent to use 
> cuda) but disables all the devices? On the one hand, yes this can be 
> considered an error if the user inadvertently disables the devices via this 
> environment variable without knowing, but on the other hand they should be 
> able to freely set this variable without petsc crashing… Should we warn 
> users? Handle this silently?
> 
> Note that petsc does provide '-device_enable none’ option to disable all 
> devices, or if you only want to disable cuda devices '-device_enable_cuda 
> none’ which should achieve the same effect as CUDA_VISIBLE_DEVICES=-1. But 
> maybe it is too obscure to ask users to know about and use these flags 
> instead of setting the cuda env variables. (Btw, can you test that using 
> ‘-device_enable_cuda none’ does not crash when setting 
> CUDA_VISIBLE_DEVICES=-1?)
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
>> On Nov 1, 2021, at 10:09, Stefano Zampini <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Just found out that if we configure with cuda and then want to run on CPU 
>> only using CUDA_VISIBLE_DEVICES=-1 PETSc errors out. Is this intended 
>> behavior? I supposed it should work
>> This is with main
>> 
>> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check
>> Running check examples to verify correct installation
>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and 
>> PETSC_ARCH=arch-ecrcml-cuda-double
>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
>> C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>> Completed test examples
>> 
>> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check 
>> CUDA_VISIBLE_DEVICES=1
>> Running check examples to verify correct installation
>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and 
>> PETSC_ARCH=arch-ecrcml-cuda-double
>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
>> C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>> Completed test examples
>> 
>> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check 
>> CUDA_VISIBLE_DEVICES=-1
>> Running check examples to verify correct installation
>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and 
>> PETSC_ARCH=arch-ecrcml-cuda-double
>> Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI process
>> See http://www.mcs.anl.gov/petsc/documentation/faq.html 
>> <http://www.mcs.anl.gov/petsc/documentation/faq.html>
>> [0]PETSC ERROR: --------------------- Error Message 
>> --------------------------------------------------------------
>> [0]PETSC ERROR: GPU error 
>> [0]PETSC ERROR: cuda error 100 (cudaErrorNoDevice) : no CUDA-capable device 
>> is detected
>> [0]PETSC ERROR: See https://petsc.org/release/faq/ 
>> <https://petsc.org/release/faq/> for trouble shooting.
>> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-368-g72b201b202  GIT 
>> Date: 2021-10-29 14:48:19 +0300
>> [0]PETSC ERROR: ./ex19 on a arch-ecrcml-cuda-double named 
>> qaysar.kaust.edu.sa <http://qaysar.kaust.edu.sa/> by zampins Mon Nov  1 
>> 18:06:12 2021
>> [0]PETSC ERROR: Configure options 
>> --with-blaslapack-include=/home/zampins/miniforge/envs/ecrcml-cuda/include 
>> --with-blaslapack-lib=/home/zampins/miniforge/envs/ecrcml-cuda/lib/libmkl_rt.so
>>  --download-h2opus --with-cuda 
>> --with-kblas-dir=/home/zampins/miniforge/envs/ecrcml-cuda 
>> --with-magma-dir=/home/zampins/miniforge/envs/ecrcml-cuda 
>> --LDFLAGS=/usr/lib/x86_64-linux-gnu/libcuda.so --with-debugging=1 
>> --with-openmp --with-precision=double --with-fc=0 
>> PETSC_ARCH=arch-ecrcml-cuda-double 
>> PETSC_DIR=/home/zampins/miniforge/Devel/petsc
>> [0]PETSC ERROR: #1 initialize() at 
>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:302
>> [0]PETSC ERROR: #2 PetscDeviceInitializeTypeFromOptions_Private() at 
>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:292
>> [0]PETSC ERROR: #3 PetscDeviceInitializeFromOptions_Internal() at 
>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:417
>> [0]PETSC ERROR: #4 PetscInitialize_Common() at 
>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:956
>> [0]PETSC ERROR: #5 PetscInitialize() at 
>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:1231
>> --------------------------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> 
>> [
> 

Reply via email to