Hello We are having performance issues with slurtmctld (delayed sinfo/squeue results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in COMP state for an extended period of time).
We just fully switch to Slurm from Univa and I think our problem is users putting a lot of "scontrol show" calls (maybe squeue/sinfo as well) in large batches of jobs and essentially DOS-ing our scheduler. Is there a built in way to throttle "squeue/sinfo/scontrol show" commands in a reasonable manner so one user can't do something dumb running jobs that keep calling these commands in bulk? If I need to make something up to verify this I am thinking about making a wrapper script around these commands that locks a shared temp file on the local disk (to avoid NFS locking issues) of each machine and then sleeps for 5 seconds before calling the real command and releasing the lock. At least this way a user will only run 1 copy per machine that their jobs land on instead of 1 per CPU core per machine. Thanks.