Hi Taras, no we have set ConstrainDevices to "yes". And this is, why CUDA_VISIBLE_DEVICES starts from zero.
Otherwise both below mentioned jobs would have been on one GPU, but as nvidia-smi shows clearly (did not show the output this time, see earlier post), both GPUs are used, environment of both jobs includes CUDA_VISIBLE_DEVICES=0.
Kota, might it be, that you did not configure ConstrainDevices in cgroup.conf? The default is "no" according to the manpage. That way, a user could set CUDA_VISIBLE_DEVICES in his job and therefore use GPUs they did not request.
Best Marcus Am 23.06.2020 um 15:41 schrieb Taras Shapovalov:
Hi Marcus,This may depend on ConstrainDevices in cgroups.conf. I guess it is set to "no" in your case.Best regards, TarasOn Tue, Jun 23, 2020 at 4:02 PM Marcus Wagner <wag...@itc.rwth-aachen.de <mailto:wag...@itc.rwth-aachen.de>> wrote:Hi Kota, thanks for the hint. Yet, I'm still a little bit astonished, as if I remember right, CUDA_VISIBLE_DEVICES in a cgroup always start from zero. That has been already years ago, as we still used LSF. But SLURM_JOB_GPUS seems to be the right thing: same node, two different users (and therefore jobs) $> xargs --null --max-args=1 echo < /proc/32719/environ | egrep "GPU|CUDA" SLURM_JOB_GPUS=0 CUDA_VISIBLE_DEVICES=0 GPU_DEVICE_ORDINAL=0 $> xargs --null --max-args=1 echo < /proc/109479/environ | egrep "GPU|CUDA" SLURM_MEM_PER_GPU=6144 SLURM_JOB_GPUS=1 CUDA_VISIBLE_DEVICES=0 GPU_DEVICE_ORDINAL=0 CUDA_ROOT=/usr/local_rwth/sw/cuda/10.1.243 CUDA_PATH=/usr/local_rwth/sw/cuda/10.1.243 CUDA_VERSION=101 SLURM_JOB_GPU differs $> scontrol show -d job 14658274 ... Nodes=nrg02 CPU_IDs=24 Mem=8192 GRES_IDX=gpu:volta(IDX:1) $> scontrol show -d job 14673550 ... Nodes=nrg02 CPU_IDs=0 Mem=8192 GRES_IDX=gpu:volta(IDX:0) Is there anyone out there, who can confirm this besides me? Best Marcus Am 23.06.2020 um 04:51 schrieb Kota Tsuyuzaki: >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always >> starts from zero. So this is NOT the index of the GPU. > > Thanks. Just FYI, when I tested the environment variables with Slurm 19.05.2 + proctrack/cgroup configuration, It looks CUDA_VISIBLE_DEVICES fits the indices on the host devices (i.e. not started from zero). I'm not sure if the behavior would be changed in the newer Slurm version though. > > I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set in environment variables that can be useful. In my current tests, those variables ware being same values with CUDA_VISILE_DEVICES. > > Any advices on what I should look for, is always welcome.. > > Best, > Kota > >> -----Original Message----- >> From: slurm-users <slurm-users-boun...@lists.schedmd.com <mailto:slurm-users-boun...@lists.schedmd.com>> On Behalf Of Marcus Wagner >> Sent: Tuesday, June 16, 2020 9:17 PM >> To: slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> >> Subject: Re: [slurm-users] How to view GPU indices of the completed jobs? >> >> Hi David, >> >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always >> starts from zero. So this is NOT the index of the GPU. >> >> Just verified it: >> $> nvidia-smi >> Tue Jun 16 13:28:47 2020 >> +-----------------------------------------------------------------------------+ >> | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: >> 10.2 | >> ... >> +-----------------------------------------------------------------------------+>> | Processes: GPU>> Memory |>> | GPU PID Type Process name Usage>> | >> |========================================================================= >> ====| >> | 0 17269 C gmx_mpi >> 679MiB | >> | 1 19246 C gmx_mpi >> 513MiB | >> +-----------------------------------------------------------------------------+ >> >> $> squeue -w nrg04>> JOBID PARTITION NAME USER ST TIME NODES>> NODELIST(REASON)>> 14560009 c18g_low egf5 bk449967 R 1-00:17:48 1 nrg04 >> 14560005 c18g_low egf1 bk449967 R 1-00:20:23 1 nrg04>> >> >> $> scontrol show job -d 14560005 >> ... >> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* >> Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0) >> >> $> scontrol show job -d 14560009 >> JobId=14560009 JobName=egf5 >> ... >> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* >> Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1) >> >> From the PIDs from nvidia-smi ouput: >> >> $> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE >> CUDA_VISIBLE_DEVICES=0 >> >> $> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE >> CUDA_VISIBLE_DEVICES=0 >> >> >> So this is only a way to see how MANY devices were used, not which. >> >> >> Best >> Marcus >> >> Am 10.06.2020 um 20:49 schrieb David Braun: >>> Hi Kota, >>> >>> This is from the job template that I give to my users: >>> >>> # Collect some information about the execution environment that may >>> # be useful should we need to do some debugging. >>> >>> echo "CREATING DEBUG DIRECTORY" >>> echo >>> >>> mkdir .debug_info >>> module list > .debug_info/environ_modules 2>&1 >>> ulimit -a > .debug_info/limits 2>&1 >>> hostname > .debug_info/environ_hostname 2>&1 >>> env |grep SLURM > .debug_info/environ_slurm 2>&1 >>> env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1 >>> env |grep OMPI > .debug_info/environ_openmpi 2>&1 >>> env > .debug_info/environ 2>&1 >>> >>> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then >>> echo "SAVING CUDA ENVIRONMENT" >>> echo >>> env |grep CUDA > .debug_info/environ_cuda 2>&1 >>> fi >>> >>> You could add something like this to one of the SLURM prologs to save >>> the GPU list of jobs. >>> >>> Best, >>> >>> David >>> >>> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki >>> <kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp> >>> <mailto:kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp>>> wrote: >>> >>> Hello Guys, >>> >>> We are running GPU clusters with Slurm and SlurmDBD (version 19.05 >>> series) and some of GPUs seemed to get troubles for attached >>> jobs. To investigate if the troubles happened on the same GPUs, I'd >>> like to get GPU indices of the completed jobs. >>> >>> In my understanding `scontrol show job` can show the indices (as IDX >>> in gres info) but cannot be used for completed job. And also >>> `sacct -j` is available for complete jobs but won't print the indices. >>> >>> Is there any way (commands, configurations, etc...) to see the >>> allocated GPU indices for completed jobs? >>> >>> Best regards, >>> >>> -------------------------------------------- >>> 露崎 浩太 (Kota Tsuyuzaki) >>> kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp> <mailto:kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp>> >>> NTTソフトウェアイノベーションセンタ >>> 分散処理基盤技術プロジェクト >>> 0422-59-2837 >>> --------------------------------------------- >>> >>> >>> >>> >>> >> >> -- >> Dipl.-Inf. Marcus Wagner >> >> IT Center >> Gruppe: Systemgruppe Linux >> Abteilung: Systeme und Betrieb >> RWTH Aachen University >> Seffenter Weg 23 >> 52074 Aachen >> Tel: +49 241 80-24383 >> Fax: +49 241 80-624383 >> wag...@itc.rwth-aachen.de <mailto:wag...@itc.rwth-aachen.de> >> www.itc.rwth-aachen.de <http://www.itc.rwth-aachen.de> >> >> Social Media Kanäle des IT Centers: >> https://blog.rwth-aachen.de/itc/ >> https://www.facebook.com/itcenterrwth >> https://www.linkedin.com/company/itcenterrwth >> https://twitter.com/ITCenterRWTH >> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ > > > >-- Dipl.-Inf. Marcus WagnerIT Center Gruppe: Systemgruppe Linux Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de <mailto:wag...@itc.rwth-aachen.de> www.itc.rwth-aachen.de <http://www.itc.rwth-aachen.de> Social Media Kanäle des IT Centers: https://blog.rwth-aachen.de/itc/ https://www.facebook.com/itcenterrwth https://www.linkedin.com/company/itcenterrwth https://twitter.com/ITCenterRWTH https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
-- Dipl.-Inf. Marcus Wagner IT Center Gruppe: Systemgruppe Linux Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de www.itc.rwth-aachen.de Social Media Kanäle des IT Centers: https://blog.rwth-aachen.de/itc/ https://www.facebook.com/itcenterrwth https://www.linkedin.com/company/itcenterrwth https://twitter.com/ITCenterRWTH https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
smime.p7s
Description: S/MIME Cryptographic Signature