All, We have a cluster that is using Azure and nodes are started up as needed.
I have encountered an interesting situation where a user did a loop to launch 100 jobs using srun. Simple job to just do an 'id' command for testing.
The intention was to have 100 jobs on 100 machines. The partition has 125 nodes configured for it. There are no limits/qos/etc to constrain them.
However, slurm only starts up 50 of the nodes, puts one job in Pending (Resources) and the others Pending (Priority).
I am unable to find the cause. Is there a limit on how many nodes of the ResumeProgam is passed to bring up at once? ResumeRate is the default 300.
Brian Andrus