Hi, If you are using cgroups for task/process management, you should verify that your /etc/slurm/cgroup.conf has the following line:
ConstrainDevices=yes I'm not sure about the missing environment variable, but the absence of the above in cgroup.conf is one way the GPU devices can be unconstrained in the jobs. -Sean On Wed, Mar 23, 2022 at 10:46 AM <taleinterve...@sjtu.edu.cn> wrote: > Hi, all: > > > > We found a problem that slurm job with argument such as *--gres gpu:1 *didn’t > be restricted with gpu usage, user still can see all gpu card on allocated > nodes. > > Our gpu node has 4 cards with their gres.conf to be: > > > cat /etc/slurm/gres.conf > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15 > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31 > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47 > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63 > > > > And for test, we submit simple job batch like: > > #!/bin/bash > > #SBATCH --job-name=test > > #SBATCH --partition=a100 > > #SBATCH --nodes=1 > > #SBATCH --ntasks=6 > > #SBATCH --gres=gpu:1 > > #SBATCH --reservation="gpu test" > > hostname > > nvidia-smi > > echo end > > > > Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect > to see only 1 allocated gpu card. > > > > Official document of slurm said it will set *CUDA_VISIBLE_DEVICES *env > var to restrict the gpu card available to user. But we didn’t find such > variable exists in job environment. We only confirmed it do exist in prolog > script environment by adding debug command “echo $CUDA_VISIBLE_DEVICES” to > slurm prolog script. > > > > So how do slurm co-operate with nvidia tools to make job user only see its > allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA > toolkit or any other part to help slurm correctly restrict the gpu usage? >