Hoping someone can help point me towards some tweaks to help prevent 
denial-of-service issues.
> sbatch: error: Batch job submission failed: Socket timed out on send/recv 
> operation

Root cause is understood, issues with shared storage for the slurmctld’s was 
impacted, leading to an increase in write latency to the StateSaveLocation.
Then with a large enough avalanche of job submissions, things the RPC’s would 
stack up and stop responding.

I’ve been running well with some tweaks sourced from the “high-throughput” 
guide <https://slurm.schedmd.com/high_throughput.html>. 

> SchedulerParameters=max_rpc_cnt=400,\
> sched_min_interval=50000,\
> sched_max_job_start=300,\
> batch_sched_delay=6
> KillWait=30
> MessageTimeout=30

I’m assuming that I was running into batch_sched_delay because looking at sdiag 
after the fact, it was averaging .2s, and total time is 5.5h out of 16h8m18s at 
the time of the sdiag sample.
> *******************************************************
> sdiag output at Thu Jan 25 11:08:18 2024 (1706198898)
> Data since      Wed Jan 24 19:00:00 2024 (1706140800)
> *******************************************************
>         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:98400  
> ave_time:201442 total_time:19821991013

Currently on 22.05.8, but hoping to get to 23.02.7 soon™, and I think this 
could possible resolve the issue well enough if I’m reading it correctly from 
the release notes 
<https://slurm.schedmd.com/archive/slurm-23.02-latest/news.html>?

> HIGHLIGHTS
> ==========
>  -- slurmctld - Add new RPC rate limiting feature. This is enabled through
>     SlurmctldParameters=rl_enable, otherwise disabled by default.

> rl_enable <https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable>Enable 
> per-user RPC rate-limiting support. Client-commands will be told to back off 
> and sleep for a second once the limit has been reached. This is implemented 
> as a "token bucket", which permits a certain degree of "bursty" RPC load from 
> an individual user before holding them to a steady-state RPC load established 
> by the refill period and rate.

But given that the hardware seems to be well over provisioned, CPU never drops 
below 5% idle, it feels like there is more room to squeeze some optimization 
out of here that I’m missing in the interim, and hoping to get a better overall 
understanding in the process.
I scrape the DBD Agent queue size from sdiag every 30s and the largest value I 
saw was 115, which is much higher than normal, but should be well below 
MaxDBDMsgs, where the minimum value is 10000.

I would really hope that I didn’t potentially hit a 30s MessageTimeout value, 
but I guess thats on the table all well, not knowing if that would potentially 
trigger the sbatch submission failure like that.

Just moving the max_rpc_cnt value up seems like an easy button, but also seems 
like it could have some adverse effects for backfill scheduling, and may be 
diminishing returns for actually keeping RPCs flowing?
> Setting max_rpc_cnt to more than 256 will be only useful to let backfill 
> continue scheduling work after locks have been yielded (i.e. each 2 seconds) 
> if there are a maximum of MAX(max_rpc_cnt/10, 20) RPCs in the queue. i.e. 
> max_rpc_cnt=1000, the scheduler will be allowed to continue after yielding 
> locks only when there are less than or equal to 100 pending RPCs. 

Obviously, fix the storage is the real solution, but hoping that there may be 
more goodness to unlock, even if it is as simple as “upgrade to 23.02”.

Appreciate any insight,
Reed

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to