> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always > starts from zero. So this is NOT the index of the GPU.
Thanks. Just FYI, when I tested the environment variables with Slurm 19.05.2 + proctrack/cgroup configuration, It looks CUDA_VISIBLE_DEVICES fits the indices on the host devices (i.e. not started from zero). I'm not sure if the behavior would be changed in the newer Slurm version though. I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set in environment variables that can be useful. In my current tests, those variables ware being same values with CUDA_VISILE_DEVICES. Any advices on what I should look for, is always welcome.. Best, Kota > -----Original Message----- > From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Marcus > Wagner > Sent: Tuesday, June 16, 2020 9:17 PM > To: slurm-users@lists.schedmd.com > Subject: Re: [slurm-users] How to view GPU indices of the completed jobs? > > Hi David, > > if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always > starts from zero. So this is NOT the index of the GPU. > > Just verified it: > $> nvidia-smi > Tue Jun 16 13:28:47 2020 > +-----------------------------------------------------------------------------+ > | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: > 10.2 | > ... > +-----------------------------------------------------------------------------+ > | Processes: GPU > Memory | > | GPU PID Type Process name Usage > | > |========================================================================= > ====| > | 0 17269 C gmx_mpi > 679MiB | > | 1 19246 C gmx_mpi > 513MiB | > +-----------------------------------------------------------------------------+ > > $> squeue -w nrg04 > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 14560009 c18g_low egf5 bk449967 R 1-00:17:48 1 nrg04 > 14560005 c18g_low egf1 bk449967 R 1-00:20:23 1 nrg04 > > > $> scontrol show job -d 14560005 > ... > Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* > Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0) > > $> scontrol show job -d 14560009 > JobId=14560009 JobName=egf5 > ... > Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* > Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1) > > From the PIDs from nvidia-smi ouput: > > $> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE > CUDA_VISIBLE_DEVICES=0 > > $> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE > CUDA_VISIBLE_DEVICES=0 > > > So this is only a way to see how MANY devices were used, not which. > > > Best > Marcus > > Am 10.06.2020 um 20:49 schrieb David Braun: > > Hi Kota, > > > > This is from the job template that I give to my users: > > > > # Collect some information about the execution environment that may > > # be useful should we need to do some debugging. > > > > echo "CREATING DEBUG DIRECTORY" > > echo > > > > mkdir .debug_info > > module list > .debug_info/environ_modules 2>&1 > > ulimit -a > .debug_info/limits 2>&1 > > hostname > .debug_info/environ_hostname 2>&1 > > env |grep SLURM > .debug_info/environ_slurm 2>&1 > > env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1 > > env |grep OMPI > .debug_info/environ_openmpi 2>&1 > > env > .debug_info/environ 2>&1 > > > > if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then > > echo "SAVING CUDA ENVIRONMENT" > > echo > > env |grep CUDA > .debug_info/environ_cuda 2>&1 > > fi > > > > You could add something like this to one of the SLURM prologs to save > > the GPU list of jobs. > > > > Best, > > > > David > > > > On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki > > <kota.tsuyuzaki...@hco.ntt.co.jp > > <mailto:kota.tsuyuzaki...@hco.ntt.co.jp>> wrote: > > > > Hello Guys, > > > > We are running GPU clusters with Slurm and SlurmDBD (version 19.05 > > series) and some of GPUs seemed to get troubles for attached > > jobs. To investigate if the troubles happened on the same GPUs, I'd > > like to get GPU indices of the completed jobs. > > > > In my understanding `scontrol show job` can show the indices (as IDX > > in gres info) but cannot be used for completed job. And also > > `sacct -j` is available for complete jobs but won't print the indices. > > > > Is there any way (commands, configurations, etc...) to see the > > allocated GPU indices for completed jobs? > > > > Best regards, > > > > -------------------------------------------- > > 露崎 浩太 (Kota Tsuyuzaki) > > kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp> > > NTTソフトウェアイノベーションセンタ > > 分散処理基盤技術プロジェクト > > 0422-59-2837 > > --------------------------------------------- > > > > > > > > > > > > -- > Dipl.-Inf. Marcus Wagner > > IT Center > Gruppe: Systemgruppe Linux > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383 > wag...@itc.rwth-aachen.de > www.itc.rwth-aachen.de > > Social Media Kanäle des IT Centers: > https://blog.rwth-aachen.de/itc/ > https://www.facebook.com/itcenterrwth > https://www.linkedin.com/company/itcenterrwth > https://twitter.com/ITCenterRWTH > https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ