Re: [slurm-users] How to view GPU indices of the completed jobs?

David Braun Wed, 10 Jun 2020 11:52:24 -0700

Hi Kota,

This is from the job template that I give to my users:


# Collect some information about the execution environment that may
# be useful should we need to do some debugging.

echo "CREATING DEBUG DIRECTORY"
echo

mkdir .debug_info
module list > .debug_info/environ_modules 2>&1
ulimit -a > .debug_info/limits 2>&1
hostname > .debug_info/environ_hostname 2>&1
env |grep SLURM > .debug_info/environ_slurm 2>&1
env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
env |grep OMPI > .debug_info/environ_openmpi 2>&1
env > .debug_info/environ 2>&1

if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
        echo "SAVING CUDA ENVIRONMENT"
        echo
        env |grep CUDA > .debug_info/environ_cuda 2>&1
fi

You could add something like this to one of the SLURM prologs to save the
GPU list of jobs.

Best,

David

On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki <
kota.tsuyuzaki...@hco.ntt.co.jp> wrote:

> Hello Guys,
>
> We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series)
> and some of GPUs seemed to get troubles for attached
> jobs. To investigate if the troubles happened on the same GPUs, I'd like
> to get GPU indices of the completed jobs.
>
> In my understanding `scontrol show job` can show the indices (as IDX in
> gres info) but cannot be used for completed job. And also
> `sacct -j` is available for complete jobs but won't print the indices.
>
> Is there any way (commands, configurations, etc...) to see the allocated
> GPU indices for completed jobs?
>
> Best regards,
>
> --------------------------------------------
> 露崎 浩太 (Kota Tsuyuzaki)
> kota.tsuyuzaki...@hco.ntt.co.jp
> NTTソフトウェアイノベーションセンタ
> 分散処理基盤技術プロジェクト
> 0422-59-2837
> ---------------------------------------------
>
>
>
>
>
>

Re: [slurm-users] How to view GPU indices of the completed jobs?

Reply via email to