Re: [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

Sean Maxwell Wed, 23 Mar 2022 08:07:38 -0700

Hi,

If you are using cgroups for task/process management, you should verify
that your /etc/slurm/cgroup.conf has the following line:


ConstrainDevices=yes

I'm not sure about the missing environment variable, but the absence of the
above in cgroup.conf is one way the GPU devices can be unconstrained in the
jobs.

-Sean



On Wed, Mar 23, 2022 at 10:46 AM <taleinterve...@sjtu.edu.cn> wrote:

> Hi, all:
>
>
>
> We found a problem that slurm job with argument such as *--gres gpu:1 *didn’t
> be restricted with gpu usage, user still can see all gpu card on allocated
> nodes.
>
> Our gpu node has 4 cards with their gres.conf to be:
>
> > cat /etc/slurm/gres.conf
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63
>
>
>
> And for test, we submit simple job batch like:
>
> #!/bin/bash
>
> #SBATCH --job-name=test
>
> #SBATCH --partition=a100
>
> #SBATCH --nodes=1
>
> #SBATCH --ntasks=6
>
> #SBATCH --gres=gpu:1
>
> #SBATCH --reservation="gpu test"
>
> hostname
>
> nvidia-smi
>
> echo end
>
>
>
> Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect
> to see only 1 allocated gpu card.
>
>
>
> Official document of slurm said it will set *CUDA_VISIBLE_DEVICES *env
> var to restrict the gpu card available to user. But we didn’t find such
> variable exists in job environment. We only confirmed it do exist in prolog
> script environment by adding debug command “echo $CUDA_VISIBLE_DEVICES” to
> slurm prolog script.
>
>
>
> So how do slurm co-operate with nvidia tools to make job user only see its
> allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA
> toolkit or any other part to help slurm correctly restrict the gpu usage?
>

Re: [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

Reply via email to