Hello
    We are having performance issues with slurtmctld (delayed sinfo/squeue 
results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in COMP 
state for an extended period of time).

We just fully switch to Slurm from Univa and I think our problem is users 
putting a lot of "scontrol show" calls (maybe squeue/sinfo as well) in large 
batches of jobs and essentially DOS-ing our scheduler.

Is there a built in way to throttle "squeue/sinfo/scontrol show" commands in a 
reasonable manner so one user can't do something dumb running jobs that keep 
calling these commands in bulk?

If I need to make something up to verify this I am thinking about making a 
wrapper script around these commands that locks a shared  temp file on the 
local disk (to avoid NFS locking issues) of each machine and then sleeps for 5 
seconds before calling the real command and releasing the lock. At least this 
way a user will only run 1 copy per machine that their jobs land on instead of 
1 per CPU core per machine.

Thanks.

Reply via email to