I assume you mean the sentence about dynamic MIG at https://slurm.schedmd.com/gres.html#MIG_Management Could it be supported? I think so, but only if one of their paying customers (that could be you) asks for it.
On Wed, Nov 22, 2023 at 11:24 AM Aaron Kollmann < aaron.kollm...@student.hpi.de> wrote: > Hello All, > > I am currently working in a research project and we are trying to find out > whether we can use NVIDIAs multi-instance GPU (MIG) dynamically in SLURM. > > For instance: > > - a user requests a job and wants a GPU but none is available > > - now SLURM will reconfigure a MIG GPU to create a partition (e.g. 1g.5gb) > which becomes available and allocated immediately > > I can already reconfigure MIG + SLURM within a few seconds to start jobs > on newly partitioned resources, but Jobs get killed when I restart slurmd > on nodes with a changed MIG config. (see script example below) > > *Do you think it is possible to develop a plugin or change SLURM to the > extent that dynamic MIG will be supported one day? * > > (The website says it is not supported) > > > > Best > > - Aaron > > > > > #!/usr/bin/bash > > # Generate Start Config > killall slurmd > killall slurmctld > nvidia-smi mig -dci > nvidia-smi mig -dgi > nvidia-smi mig -cgi 19,14,5 -i 0 -C > nvidia-smi mig -cgi 0 -i 1 -C > cp -f ./slurm-19145-0.conf /etc/slurm/slurm.conf > slurmd -c > slurmctld -c > sleep 5 > > # Start a running and a pending job (the first job gets killed by slurm) > srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 & > srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 & > sleep 5 > > # Simulate MIG Config Change > nvidia-smi mig -i 1 -dci > nvidia-smi mig -i 1 -dgi > nvidia-smi mig -cgi 19,14,5 -i 1 -C > cp -f ./slurm-2x19145.conf /etc/slurm/slurm.conf > killall slurmd > killall slurmctld > slurmd > slurmctld >