Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

Ole Holm Nielsen Tue, 27 Aug 2019 01:26:12 -0700

Hi Guillaume,

The performance of the slurmctld server depends strongly on the serverhardware on which it is running! This should be taken into account whenconsidering your question.

SchedMD recommends that the slurmctld server should have only a few, butvery fast CPU cores, in order to ensure the best responsiveness.

The file system for /var/spool/slurmctld/ should be mounted on thefastest possible disks (SSD or NVMe if possible).

You should also read the Large Cluster Administration Guide athttps://slurm.schedmd.com/big_sys.html

Furthermore, it may perhaps be a good idea to have the MySQL databaseserver installed on a separate server so that it doesn't slow down theslurmctld.


Best regards,
Ole

On 8/27/19 9:45 AM, Guillaume Perrault Archambault wrote:

Hi Paul,

Thanks a lot for your suggestion.
The cluster I'm using has thousands of users, so I'm doubtful the adminswill change this setting just for me. But I'll mention it to the supportteam I'm working with.
I was hoping more for something that can be done on the user end.
Is there some way for the user to measure whether the scheduler is inRPC saturation? And then if it is, I could make sure my script doesn'tlaunch too many jobs in parallel.
Sorry if my question is too vague, I don't understand the backend of theSLURM scheduler too well, so my questions are using the limitedterminology of a user.
My concern is just to make sure that my scripts don't send out morecommands (simultaneously) than the scheduler can handle.
For example, as an extreme scenario, suppose a user forks off 1000sbatch commands in parallel, is that more than the scheduler can handle?As a user, how can I know whether it is?

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

Reply via email to