[slurm-users] Slurm release candidate version 22.05rc1 available for testing

2022-05-12 Thread Tim Wickberg
We are pleased to announce the availability of Slurm release candidate version 22.05rc1. To highlight some new features coming in 22.05: - Support for dynamic node addition and removal - Support for native cgroup/v2 operation - Newly added plugins to support HPE Slingshot 11 networks (switch/h

Re: [slurm-users] [External] Re: Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread Michael Robbert
Have you looked at the High Throughput Computing Administration Guide: https://slurm.schedmd.com/high_throughput.htmlIn particular, for this problem may be to look at the SchedulerParameters. I believe that the scheduler defaults to be very conservative and will stop looking for jobs to run pretty

Re: [slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread David Henkemeyer
Thanks Brian. We have it set to 100k, which has really improved our performance on the A partition. We queue up 50k+ jobs nightly, and see really good node utilization, so deep jobs are being considered. Could be that we have the scheduler too busy doing certain things, that it takes a while for

Re: [slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread Brian Andrus
I suspect you have too low of a setting for "MaxJobCount" *MaxJobCount* The maximum number of jobs SLURM can have in its active database at one time. Set the values of*MaxJobCount* and*MinJobAge* to insure the slurmctld daemon does not exhaust its me

Re: [slurm-users] [External] Re: [EXT] Software and Config for Job submission host only

2022-05-12 Thread Michael Robbert
Don’t forget about munge. You need to have munged running with the same key as the rest of the cluster in order to authenticate. Mike RobbertCyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research ComputingInformation and Technology Solutions (ITS)303-273-3786 | mrobb...@mines.edu

[slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread David Henkemeyer
Question for the braintrust: I have 3 partitions: - Partition A_highpri: 80 nodes - Partition A_lowpri: same 80 nodes - Partition B_lowpri: 10 different nodes There is no overlap between A and B partitions. Here is what I'm observing. If I fill the queue with ~20-30k jobs for partiti

Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Paul Edmon
They fix this in newer versions of Slurm.  We had the same issue with older versions so we hard to run with the config_override option on to keep the logs quiet.  They changed the way logging was done in the more recent releases and its not as chatty. -Paul Edmon- On 5/12/22 7:35 AM, Per Lönn

Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
Per Lönnborg writes: > I "forgot" to tell our version because it´s a bit embarrising - 19.05.8... Haha! :D -- B/H signature.asc Description: PGP signature

Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
Per Lönnborg writes: > Greetings, God dag! > is there a way to lower the log rate on error messages in slurmctld for nodes > with hardware errors? You don't say which version of Slurm you are running, but I think this was changed in 21.08, so the node will only try to register once if it has

[slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Per Lönnborg
Greetings, is there a way to lower the log rate on error messages in slurmctld for nodes with hardware errors? We see for example this for a node that has DIMM errors: [2022-05-12T07:07:34.757] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:35.760] error:

Re: [slurm-users] [EXT] Software and Config for Job submission host only

2022-05-12 Thread Tina Friedrich
Hi Richard, I was about to say, they need to have access to the configuration (slurm.conf) and the binaries. And not run slurmd; starting slurmd is what makes an execution host :) There is nothing you need to do to allow job submission from them. I build rpms; on the login nodes I install th

Re: [slurm-users] [EXT] Software and Config for Job submission host only

2022-05-12 Thread Ozeryan, Vladimir
Hello, All you need to setup is the path to the Slurm binaries whether they are available via shared file system or locally on the submit nodes (srun, sbatch, sinfo, sacct, etc.) and possibly man pages. Probably want to do this somewhere in /etc/profile.d or equivalent. -Original Message--

[slurm-users] Software and Config for Job submission host only

2022-05-12 Thread Richard Chang
Hi, I am new to SLURM and I am still trying to understand stuff. There is ample documentation available that teaches you how to set it up quickly. Pardon me if this was asked before,  I was not able to find anything pointing to this. I am trying to figure out if there is something like PBS-