Hi Kota, This is from the job template that I give to my users:
# Collect some information about the execution environment that may # be useful should we need to do some debugging. echo "CREATING DEBUG DIRECTORY" echo mkdir .debug_info module list > .debug_info/environ_modules 2>&1 ulimit -a > .debug_info/limits 2>&1 hostname > .debug_info/environ_hostname 2>&1 env |grep SLURM > .debug_info/environ_slurm 2>&1 env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1 env |grep OMPI > .debug_info/environ_openmpi 2>&1 env > .debug_info/environ 2>&1 if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then echo "SAVING CUDA ENVIRONMENT" echo env |grep CUDA > .debug_info/environ_cuda 2>&1 fi You could add something like this to one of the SLURM prologs to save the GPU list of jobs. Best, David On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki < kota.tsuyuzaki...@hco.ntt.co.jp> wrote: > Hello Guys, > > We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series) > and some of GPUs seemed to get troubles for attached > jobs. To investigate if the troubles happened on the same GPUs, I'd like > to get GPU indices of the completed jobs. > > In my understanding `scontrol show job` can show the indices (as IDX in > gres info) but cannot be used for completed job. And also > `sacct -j` is available for complete jobs but won't print the indices. > > Is there any way (commands, configurations, etc...) to see the allocated > GPU indices for completed jobs? > > Best regards, > > -------------------------------------------- > 露崎 浩太 (Kota Tsuyuzaki) > kota.tsuyuzaki...@hco.ntt.co.jp > NTTソフトウェアイノベーションセンタ > 分散処理基盤技術プロジェクト > 0422-59-2837 > --------------------------------------------- > > > > > >