[slurm-users] Two jobs each with a different partition running on same node?

2024-01-29 Thread Loris Bennett
Hi,

I seem to remember that in the past, if a node was configured to be in
two partitions, the actual partition of the node was determined by the
partition associated with the jobs running on it.  Moreover, at any
instance where the node was running one or more jobs, the node could
only actually be in a single partition.

Was this indeed the case and is it still the case with version Slurm
23.02.7?

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin



Re: [slurm-users] Two jobs each with a different partition running on same node?

2024-01-29 Thread Paul Edmon
That certainly isn't the case in our configuration. We have multiple 
overlapping partitions and our nodes have a mix of jobs from all 
different partitions.  So the default behavior is to have a mixing of 
partitions on a node governed by the Priority Tier of the partition. 
Namely the highest priority tier always goes first but jobs from the 
lower tiers can fill in the gaps on a node.


Having multiple partitions and then having only one own a node if it 
happens to have a job running isn't a standard option to my knowledge. 
You can accomplish this though with MCS which I know can lock down nodes 
to specific users and groups. But what you describe sounds more like you 
are locking down based on partition not on user or group, which I'm not 
how to accomplish in the current version of slurm.


Doesn't mean its not possible, I just don't know how unless it is some 
obscure option.


-Paul Edmon-

On 1/29/2024 9:25 AM, Loris Bennett wrote:

Hi,

I seem to remember that in the past, if a node was configured to be in
two partitions, the actual partition of the node was determined by the
partition associated with the jobs running on it.  Moreover, at any
instance where the node was running one or more jobs, the node could
only actually be in a single partition.

Was this indeed the case and is it still the case with version Slurm
23.02.7?

Cheers,

Loris





[slurm-users] Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-29 Thread Robert Kudyba
According to these links:
https://rpmfind.net/linux/rpm2html/search.php?query=slurm
https://src.fedoraproject.org/rpms/slurm

Why doesn't RHEL 8 get a newer version? Can someone update the repo
maintainer Philip Kovacs  <
pk...@fedoraproject.org>? There was a ticket at
https://bugzilla.redhat.com/show_bug.cgi?id=1912491 but no movement on RHEL
8.


[slurm-users] Socket timed out - tuning

2024-01-29 Thread Reed Dier
Hoping someone can help point me towards some tweaks to help prevent 
denial-of-service issues.
> sbatch: error: Batch job submission failed: Socket timed out on send/recv 
> operation

Root cause is understood, issues with shared storage for the slurmctld’s was 
impacted, leading to an increase in write latency to the StateSaveLocation.
Then with a large enough avalanche of job submissions, things the RPC’s would 
stack up and stop responding.

I’ve been running well with some tweaks sourced from the “high-throughput” 
guide . 

> SchedulerParameters=max_rpc_cnt=400,\
> sched_min_interval=5,\
> sched_max_job_start=300,\
> batch_sched_delay=6
> KillWait=30
> MessageTimeout=30

I’m assuming that I was running into batch_sched_delay because looking at sdiag 
after the fact, it was averaging .2s, and total time is 5.5h out of 16h8m18s at 
the time of the sdiag sample.
> ***
> sdiag output at Thu Jan 25 11:08:18 2024 (1706198898)
> Data since  Wed Jan 24 19:00:00 2024 (1706140800)
> ***
> REQUEST_SUBMIT_BATCH_JOB( 4003) count:98400  
> ave_time:201442 total_time:19821991013

Currently on 22.05.8, but hoping to get to 23.02.7 soon™, and I think this 
could possible resolve the issue well enough if I’m reading it correctly from 
the release notes 
?

> HIGHLIGHTS
> ==
>  -- slurmctld - Add new RPC rate limiting feature. This is enabled through
> SlurmctldParameters=rl_enable, otherwise disabled by default.

> rl_enable Enable 
> per-user RPC rate-limiting support. Client-commands will be told to back off 
> and sleep for a second once the limit has been reached. This is implemented 
> as a "token bucket", which permits a certain degree of "bursty" RPC load from 
> an individual user before holding them to a steady-state RPC load established 
> by the refill period and rate.

But given that the hardware seems to be well over provisioned, CPU never drops 
below 5% idle, it feels like there is more room to squeeze some optimization 
out of here that I’m missing in the interim, and hoping to get a better overall 
understanding in the process.
I scrape the DBD Agent queue size from sdiag every 30s and the largest value I 
saw was 115, which is much higher than normal, but should be well below 
MaxDBDMsgs, where the minimum value is 1.

I would really hope that I didn’t potentially hit a 30s MessageTimeout value, 
but I guess thats on the table all well, not knowing if that would potentially 
trigger the sbatch submission failure like that.

Just moving the max_rpc_cnt value up seems like an easy button, but also seems 
like it could have some adverse effects for backfill scheduling, and may be 
diminishing returns for actually keeping RPCs flowing?
> Setting max_rpc_cnt to more than 256 will be only useful to let backfill 
> continue scheduling work after locks have been yielded (i.e. each 2 seconds) 
> if there are a maximum of MAX(max_rpc_cnt/10, 20) RPCs in the queue. i.e. 
> max_rpc_cnt=1000, the scheduler will be allowed to continue after yielding 
> locks only when there are less than or equal to 100 pending RPCs. 

Obviously, fix the storage is the real solution, but hoping that there may be 
more goodness to unlock, even if it is as simple as “upgrade to 23.02”.

Appreciate any insight,
Reed

smime.p7s
Description: S/MIME cryptographic signature