Not a solution to your exact problem, but we document partitions for 
interactive, debug, and batch, and have a job_submit.lua [1] that routes 
GPU-reserving jobs to gpu-interactive, gpu-debug, and gpu partitions 
automatically. Since our GPU nodes have extra memory slots, and have tended to 
run at less than 100% CPU usage during GPU jobs, they also serve as our 
large-memory and small interactive job targets.

[1] https://gist.github.com/mikerenfro/df89fac5052a45cc2c1651b9a30978e0

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
Ratnasamy, Fritz <fritz.ratnas...@chicagobooth.edu>
Date: Tuesday, August 24, 2021 at 9:59 PM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] scancel gpu jobs when gpu is not requested

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________
Hello,

I have written a script in my prolog.sh that cancels any slurm job if the 
parameter gres=gpu is not present. This is the script i added to my prolog.sh

if [ $SLURM_JOB_PARTITION == "gpu" ]; then
        if [ ! -z "${GPU_DEVICE_ORDINAL}" ]; then
                echo "GPU ID used is ID: $GPU_DEVICE_ORDINAL "
                list_gpu=$(echo "$GPU_DEVICE_ORDINAL" | sed -e "s/,//g")
                Ngpu=$(expr length $list_gpu)
        else
                echo "No GPU selected"
                Ngpu=0
        fi

       # if  0 gpus were allocated, cancel the job
        if [ "$Ngpu" -eq "0" ]; then
              scancel ${SLURM_JOB_ID}                                          
fi
fi

What the code does is look at the number of gpus allocated, and if it is 0, 
cancel the job ID. It working fine if a user use sbatch submit.sh (and the 
submit.sh do not have the value --gres=gpu:1). However, when requesting an 
interactive session without gpus, the job is getting killed and the job hangs 
for 5-6 mins before getting killed.


jlo@mfe01:~ $ srun --partition=gpu --pty bash --login

srun: job 4631872 queued and waiting for resources

srun: job 4631872 has been allocated resources

srun: Force Terminated job 4631872 ...the killing hangs for 5-6minutes
Is there anything wrong with my script? Why only when scancel an interactive 
session, I am seeing this hanging. I would like to remove the hanging
Thanks
Fritz Ratnasamy
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556

Reply via email to