We are pleased to announce the availability of Slurm release candidate
version 22.05rc1.
To highlight some new features coming in 22.05:
- Support for dynamic node addition and removal
- Support for native cgroup/v2 operation
- Newly added plugins to support HPE Slingshot 11 networks
(switch/h
Have you looked at the High Throughput Computing Administration Guide: https://slurm.schedmd.com/high_throughput.htmlIn particular, for this problem may be to look at the SchedulerParameters. I believe that the scheduler defaults to be very conservative and will stop looking for jobs to run pretty
Thanks Brian. We have it set to 100k, which has really improved our
performance on the A partition. We queue up 50k+ jobs nightly, and see
really good node utilization, so deep jobs are being considered.
Could be that we have the scheduler too busy doing certain things, that it
takes a while for
I suspect you have too low of a setting for "MaxJobCount"
*MaxJobCount*
The maximum number of jobs SLURM can have in its active database
at one time. Set the values of*MaxJobCount* and*MinJobAge* to
insure the slurmctld daemon does not exhaust its me
Don’t forget about munge. You need to have munged running with the same key as the rest of the cluster in order to authenticate. Mike RobbertCyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research ComputingInformation and Technology Solutions (ITS)303-273-3786 | mrobb...@mines.edu
Question for the braintrust:
I have 3 partitions:
- Partition A_highpri: 80 nodes
- Partition A_lowpri: same 80 nodes
- Partition B_lowpri: 10 different nodes
There is no overlap between A and B partitions.
Here is what I'm observing. If I fill the queue with ~20-30k jobs for
partiti
They fix this in newer versions of Slurm. We had the same issue with
older versions so we hard to run with the config_override option on to
keep the logs quiet. They changed the way logging was done in the more
recent releases and its not as chatty.
-Paul Edmon-
On 5/12/22 7:35 AM, Per Lönn
Per Lönnborg writes:
> I "forgot" to tell our version because it´s a bit embarrising - 19.05.8...
Haha! :D
--
B/H
signature.asc
Description: PGP signature
Per Lönnborg writes:
> Greetings,
God dag!
> is there a way to lower the log rate on error messages in slurmctld for nodes
> with hardware errors?
You don't say which version of Slurm you are running, but I think this
was changed in 21.08, so the node will only try to register once if it
has
Greetings,
is there a way to lower the log rate on error messages in slurmctld for nodes
with hardware errors?
We see for example this for a node that has DIMM errors:
[2022-05-12T07:07:34.757] error: Node node37 has low real_memory size (257642 <
257660)
[2022-05-12T07:07:35.760] error:
Hi Richard,
I was about to say, they need to have access to the configuration
(slurm.conf) and the binaries. And not run slurmd; starting slurmd is
what makes an execution host :)
There is nothing you need to do to allow job submission from them.
I build rpms; on the login nodes I install th
Hello,
All you need to setup is the path to the Slurm binaries whether they are
available via shared file system or locally on the submit nodes (srun, sbatch,
sinfo, sacct, etc.) and possibly man pages.
Probably want to do this somewhere in /etc/profile.d or equivalent.
-Original Message--
Hi,
I am new to SLURM and I am still trying to understand stuff. There is
ample documentation available that teaches you how to set it up quickly.
Pardon me if this was asked before, I was not able to find anything
pointing to this.
I am trying to figure out if there is something like PBS-
13 matches
Mail list logo