Hi, > So here's something funny. One user submitted a job that requested 60 cpu's > and 400000M of memory. Our largest nodes in that partition have 72 cpu's and > 256G of memory. So when a user requests 400G of ram, what would be good > behavior? I would like to see slurm reject the job, "job is impossible to > run." Instead, slurm keeps slowly growing the priority of that job (because > of fairshare) and the job effectively disables the nodes that are trying to > free up memory for it. (All the nodes that have enough cpu's). Not just one > node that has enough cpu's. This is a combination of *multiple* bad > behaviors. Stopping a node that can never satisfy the request is bad... > Stopping *all* nodes that have enough cpu's, even though none of them can > ever satisfy the request is extra bad.
EnforcePartLimits=YES would be your friend :) We also use a submission filter that checks for requests and adjust some things if needed :) Regards, -- Andy
signature.asc
Description: PGP signature