Hello Matthew, You may be aware of this already, but most sites would make these kinds of checks/validations using job_submit.lua. I'm not an expert in that - though plenty of others on this list are - but I'm positive you could implement this type of validation logic. I'd like to say that I've come across a good tutorial for job_submit.lua, but I haven't really found one. This is kind of a good intro:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins You can also find some sample scripts, such as: https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua Warmest regards, Jason On Tue, Feb 27, 2024 at 5:02 PM Matthew R. Baney via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello Slurm users, > > I'm trying to write a check in our job_submit.lua script that enforces > relative resource requirements such as disallowing more than 4 CPUs or 48GB > of memory per GPU. The QOS itself has a MaxTRESPerJob of > cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to > prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with > only 1 GPU. > > I might be missing something obvious, but the rabbit hole I'm going down > at the moment is trying to check all of the different ways job arguments > could be set in the job descriptor. > > i.e., the following should all be disallowed: > > srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the > descriptor) > > srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres) > > srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks, > ntasks_per_tres) > > srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks, > mem_per_cpu) > > ... > > Essentially what I'm looking for is a way to access the ReqTRES string > from the job record before it exists, and then run some logic against that > i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, > error out. > > Is something like this possible? > > Thanks, > Matthew > > -- > Matthew Baney > Assistant Director of Computational Systems > mba...@umd.edu | (301) 405-6756 > University of Maryland Institute for Advanced Computer Studies > 3154 Brendan Iribe Center > 8125 Paint Branch Dr. > College Park, MD 20742 > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com