Hello All,I am currently working in a research project and we are trying to find out whether we can use NVIDIAs multi-instance GPU (MIG) dynamically in SLURM.
For instance: - a user requests a job and wants a GPU but none is available- now SLURM will reconfigure a MIG GPU to create a partition (e.g. 1g.5gb) which becomes available and allocated immediately
I can already reconfigure MIG + SLURM within a few seconds to start jobs on newly partitioned resources, but Jobs get killed when I restart slurmd on nodes with a changed MIG config. (see script example below)
*Do you think it is possible to develop a plugin or change SLURM to the extent that dynamic MIG will be supported one day? *
(The website says it is not supported)* * * * Best - Aaron* * #!/usr/bin/bash # Generate Start Config killall slurmd killall slurmctld nvidia-smi mig -dci nvidia-smi mig -dgi nvidia-smi mig -cgi 19,14,5 -i 0 -C nvidia-smi mig -cgi 0 -i 1 -C cp -f ./slurm-19145-0.conf /etc/slurm/slurm.conf slurmd -c slurmctld -c sleep 5 # Start a running and a pending job (the first job gets killed by slurm) srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 & srun -w gx06 -c 2 --mem 1G --gres=gpu:a100_1g.5gb:1 sleep 300 & sleep 5 # Simulate MIG Config Change nvidia-smi mig -i 1 -dci nvidia-smi mig -i 1 -dgi nvidia-smi mig -cgi 19,14,5 -i 1 -C cp -f ./slurm-2x19145.conf /etc/slurm/slurm.conf killall slurmd killall slurmctld slurmd slurmctld
smime.p7s
Description: S/MIME Cryptographic Signature