Hi,
I am new to SLURM and I am still trying to understand stuff. There is
ample documentation available that teaches you how to set it up quickly.
Pardon me if this was asked before, I was not able to find anything
pointing to this.
I am trying to figure out if there is something like PBS-
Hello,
All you need to setup is the path to the Slurm binaries whether they are
available via shared file system or locally on the submit nodes (srun, sbatch,
sinfo, sacct, etc.) and possibly man pages.
Probably want to do this somewhere in /etc/profile.d or equivalent.
-Original Message--
Hi Richard,
I was about to say, they need to have access to the configuration
(slurm.conf) and the binaries. And not run slurmd; starting slurmd is
what makes an execution host :)
There is nothing you need to do to allow job submission from them.
I build rpms; on the login nodes I install th
Greetings,
is there a way to lower the log rate on error messages in slurmctld for nodes
with hardware errors?
We see for example this for a node that has DIMM errors:
[2022-05-12T07:07:34.757] error: Node node37 has low real_memory size (257642 <
257660)
[2022-05-12T07:07:35.760] error:
Per Lönnborg writes:
> Greetings,
God dag!
> is there a way to lower the log rate on error messages in slurmctld for nodes
> with hardware errors?
You don't say which version of Slurm you are running, but I think this
was changed in 21.08, so the node will only try to register once if it
has
Per Lönnborg writes:
> I "forgot" to tell our version because it´s a bit embarrising - 19.05.8...
Haha! :D
--
B/H
signature.asc
Description: PGP signature
They fix this in newer versions of Slurm. We had the same issue with
older versions so we hard to run with the config_override option on to
keep the logs quiet. They changed the way logging was done in the more
recent releases and its not as chatty.
-Paul Edmon-
On 5/12/22 7:35 AM, Per Lönn
Question for the braintrust:
I have 3 partitions:
- Partition A_highpri: 80 nodes
- Partition A_lowpri: same 80 nodes
- Partition B_lowpri: 10 different nodes
There is no overlap between A and B partitions.
Here is what I'm observing. If I fill the queue with ~20-30k jobs for
partiti
Don’t forget about munge. You need to have munged running with the same key as the rest of the cluster in order to authenticate. Mike RobbertCyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research ComputingInformation and Technology Solutions (ITS)303-273-3786 | mrobb...@mines.edu
I suspect you have too low of a setting for "MaxJobCount"
*MaxJobCount*
The maximum number of jobs SLURM can have in its active database
at one time. Set the values of*MaxJobCount* and*MinJobAge* to
insure the slurmctld daemon does not exhaust its me
Thanks Brian. We have it set to 100k, which has really improved our
performance on the A partition. We queue up 50k+ jobs nightly, and see
really good node utilization, so deep jobs are being considered.
Could be that we have the scheduler too busy doing certain things, that it
takes a while for
Have you looked at the High Throughput Computing Administration Guide: https://slurm.schedmd.com/high_throughput.htmlIn particular, for this problem may be to look at the SchedulerParameters. I believe that the scheduler defaults to be very conservative and will stop looking for jobs to run pretty
We are pleased to announce the availability of Slurm release candidate
version 22.05rc1.
To highlight some new features coming in 22.05:
- Support for dynamic node addition and removal
- Support for native cgroup/v2 operation
- Newly added plugins to support HPE Slingshot 11 networks
(switch/h
13 matches
Mail list logo