Hi Guillaume,
The performance of the slurmctld server depends strongly on the server
hardware on which it is running! This should be taken into account when
considering your question.
SchedMD recommends that the slurmctld server should have only a few, but
very fast CPU cores, in order to ensure the best responsiveness.
The file system for /var/spool/slurmctld/ should be mounted on the
fastest possible disks (SSD or NVMe if possible).
You should also read the Large Cluster Administration Guide at
https://slurm.schedmd.com/big_sys.html
Furthermore, it may perhaps be a good idea to have the MySQL database
server installed on a separate server so that it doesn't slow down the
slurmctld.
Best regards,
Ole
On 8/27/19 9:45 AM, Guillaume Perrault Archambault wrote:
Hi Paul,
Thanks a lot for your suggestion.
The cluster I'm using has thousands of users, so I'm doubtful the admins
will change this setting just for me. But I'll mention it to the support
team I'm working with.
I was hoping more for something that can be done on the user end.
Is there some way for the user to measure whether the scheduler is in
RPC saturation? And then if it is, I could make sure my script doesn't
launch too many jobs in parallel.
Sorry if my question is too vague, I don't understand the backend of the
SLURM scheduler too well, so my questions are using the limited
terminology of a user.
My concern is just to make sure that my scripts don't send out more
commands (simultaneously) than the scheduler can handle.
For example, as an extreme scenario, suppose a user forks off 1000
sbatch commands in parallel, is that more than the scheduler can handle?
As a user, how can I know whether it is?