Daniel,
One way to set up a true HA is to configure master-master SQL instances
on both head nodes. Then have each slurmdbd point to the other SQL
instance as the backup host.
This is likely not necessary as all data going to slurmdbd is cached if
slurmdbd is unavailable. In the real world,
Steven,
Looks like you may have had a secondary controller that took over and
changed your StateSave files.
IF you don't need the job info AND no jobs are running, you can just
rename/delete your StateSaveLocation directory and things will be
recreated. Job numbers will start over (unless y
Run 'sinfo -R' to see the reason any nodes may be down.
It may be as simple as running 'scontrol update state=resume
nodename=' to bring them back, if they are down. It depends on the
reason they went down (if that is the issue).
Otherwise, check the job requirements to see what it is ask
Ensure cgroups is working and configured to limit access to devices
(which includes gpus).
Check your cgroup.conf to see that there is an entry for:
ConstrainDevices=yes
Brian Andrus
On 1/3/2025 10:49 AM, Groner, Rob via slurm-users wrote:
I'm not entirely sure, and I can't vouch for di
Not sure anyone would know, but...
If you are running slurm in HA mode (multiple SlurmctldHost entries) is
it possible to point sackd to more than one using the --conf-server option?
So either specify --conf-server more than once, or have a
comma-delimited list of them?
The docs are a little
You only have one partition named 'default'
You are not allowed to name it that. Name it something else and you
should be good.
Brian Andrus
On 11/28/2024 6:52 AM, Patrick Begou via slurm-users wrote:
Hi Kent,
on your management node could you run:
systemctl status slurmctld
and check your
107872968?pid=nativeplacement&c=Global_Acquisition_YMktg_315_Internal_EmailSignature&af_sub1=Acquisition&af_sub2=Global_YMktg&af_sub3=&af_sub4=10604&af_sub5=EmailSignature__Static_>
On Friday, November 1, 2024, 1:12 AM, Brian Andrus via slurm-users
wrote:
Likely ma
Likely many ways to do this, but if you have some code that is dependent
on something, that check could be in the code itself.
So instead of process 0 being the required process to run, it would be
whichever process meets the requirements.
eg:
case hostname:
harold)
Run harold's stuff he
Looks like there is a step you would need to do to create the required
job mapping files:
/The DCGM-exporter can include High-Performance Computing (HPC) job
information into its metric labels. To achieve this, HPC environment
administrators must configure their HPC environment to generate fil
IIRC, you need to ensure reverse lookup for DNS matches your nodename
Brian Andrus
On 9/20/2024 4:55 PM, Jakub Szarlat via slurm-users wrote:
Hi
I'm using dynamic nodes with "slurmd -Z" with slurm 23.11.1.
Firstly I find that when you do "scontrol show node" it shows the NODEADDR as
ip rather
Folks have addressed the obvious config settings, but also check your
prolog/epilog scripts/settings as well as the .bashrc/.bash_profile and
stuff in /etc/profile.d/
That may be hanging it up.
Brian Andrus
On 9/5/2024 5:17 AM, Loris Bennett via slurm-users wrote:
Hi,
With
$ salloc --ver
Angel,
Unless you are using cgroups and constraints, there is no limit imposed.
The numbers are used by slurm to track what is available, not what you
may/may not use. So you could tell slurm the node only has 1GB and it
will not let you request more than that, but if you do request only 1GB,
to "--nodes=1,5,9,13".
...
The job will be allocated as many nodes as possible within the range specified
and without delaying the
initiation of the job.
____
From: Brian Andrus via slurm-users
Sent: Thursday, August 29, 2024 7:27:44 PM
To:slurm-users@lists.sche
there is this example for :
--nodes=1,5,9,13
so either one specifies [-maxnodes] OR .
I checked the logs, and there are no reported errors about wrong or ignored
options.
MG
From: Brian Andrus via slurm-users
Sent: Thursday, August 29, 2024 4:11:25 PM
To:
Your --nodes line is incorrect:
*-N*,*--nodes*=[-/maxnodes/]|
Request that a minimum of/minnodes/nodes be allocated to this job. A
maximum node count may also be specified with/maxnodes/.
Looks like it ignored that and used ntasks with ntasks-per-node as 1,
giving you 3 nodes. Check your
IIRC, slurm parses the batch file as options until it hits the first
non-comment line, which includes blank lines.
You may want to double-check some of the gaps in the option section of
your batch script.
That being said and you say you removed the '&' at the end of the
command, which would
It sounds like the new version was built with different options, and/or
an install was not done via packages.
If you do use rpms, you could try:
dnf provides /usr/lib64/slurm/mpi_none.so
If that shows a package that is installed, remove it. If it shows
nothing, move the file elsewhere and
If you need it, you could add it to either prologue or epilogue to store
the info somewhere.
I do that for the scripts themselves and keep the past two weeks backed
up so we can debug if/when there is an issue.
Brian Andrus
On 8/7/2024 6:29 AM, Steffen Grunewald via slurm-users wrote:
On We
Felix,
Finished jobs roll off the list shown in squeue, so that may be no
surprise (depending on settings). If there was a power failure that
caused the nodes to restart, it could also be that the job had not been
written to slurmdbd, making it unavailable to sacct as well.
Your logs look to
Generally speaking, when the batch script exits, slurm will clean up (ie
kill) any stray processes.
So, I would expect that executable to be killed at cleanup.
Brian Andrus
On 7/26/2024 2:45 AM, Steffen Grunewald via slurm-users wrote:
On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote:
Martin,
In a nutshell, when slurmd starts, it tells that info to slurmctld. That
is the "registration" event mentioned.
Brian Andrus
On 7/19/2024 5:44 AM, Martin Lee via slurm-users wrote:
I've read the following in the slurm power saving docs:
https://slurm.schedmd.com/power_save.html
*cl
You probably want to look at scontrol show node and scontrol show job
for that node and the jobs on it.
Compare those and you may find someone requested most all the resources,
but are not running them properly. Look at the job itself to see what it
is trying to do.
Brian Andrus
On 7/11/202
Jack,
To make sure things are set right, run 'slurmd -C' on the node and use
that output in your config.
It can also give you insight as to what is being seen on the node versus
what you may expect.
Brian Andrus
On 7/10/2024 1:25 AM, jack.mellor--- via slurm-users wrote:
Hi,
We are runni
PU=16 DefMemPerGPU=128985
And in the compute node /etc/slurm/gres.conf is:
Name=gpu File=/dev/nvidia[0-7]
Name=shard Count=32
Thank you!
--
Ricardo Cruz - https://rpmcruz.github.io
<https://rpmcruz.github.io/>
Brian Andrus via slurm-users escreveu
(quinta, 4/07/2024 à(s) 17:16):
To
To help dig into it, can you paste the full output of scontrol show node
compute01 while the job is pending? Also 'sinfo' would be good.
It is basically telling you there aren't enough resources in the
partition to run the job. Often this is because all the nodes are in use
at that moment.
B
Carl,
You cannot tell from the binary alone.
It looks like you just did an apt-get install slurm or such under
Ubuntu. Would that be right?
You may be able to look at the package and see info about the build
environment.
Generally, it is best to build slurm yourself for the environment it is
Well, if I am reading this right, it makes sense.
Every job will need at least 1 core just to run and if there are only 4
cores on the machine, one would expect a max of 4 jobs to run.
Brian Andrus
On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
I have a machine with a quad-core CPU and a
That SIGTERM message means something is telling slurmdbd to quit.
Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told
to shutdown. If you are running in the foreground, a ^C does that. If
you run a kill or killall on it, you will get that same message.
Brian Andrus
On 5/30
Oh, to address the passed train:
Restore the archive data with "sacctmgr archive load", then you can do
as you need.
From man sacctmgr:
*archive*{dump|load}
Write database information to a flat file or load information that
has previously been written to a file.
Brian Andrus
Setup
Instead of using the archive files, couldn't you query the db directly
for the info you need?
I would recommend sacct/sreport if those can get the info you need.
Brian Andrus
On 5/28/2024 9:59 AM, O'Neal, Doug (NIH/NCI) [C] via slurm-users wrote:
My organization needs to access historic job
On 5/23/2024 6:16 AM, Christopher Samuel via slurm-users wrote:
On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote:
A simple example is when you have nodes with and without GPUs.
You can build slurmd packages without for those nodes and with for
the ones that have them.
FWIW we have bot
Not that I recommend it much, but you can build them for each
environment and install the ones needed in each.
A simple example is when you have nodes with and without GPUs.
You can build slurmd packages without for those nodes and with for the
ones that have them.
Generally, so long as versi
Rike,
Assuming the data, scripts and other dependencies are already on the
cluster, you could just ssh and execute the sbatch command in a single
shot: ssh submitnode sbatch some_script.sh
It will ask for a password if appropriate and could use ssh keys to
bypass that need.
Brian Andrus
O
/...). Wouldn't Slurm pick up that one?
Thanks!
Jeff
On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users
wrote:
This is because you have no slurm.conf in /etc/slurm, so it it is
trying 'configless' which queries DNS to find out where to get the
config. It is
This is because you have no slurm.conf in /etc/slurm, so it it is trying
'configless' which queries DNS to find out where to get the config. It
is failing because you do not have DNS configured to tell nodes where to
ask about the config.
Simple solution: put a copy of slurm.conf in /etc/slurm
Xaver,
If you look at your slurmctld log, you likely end up seeing messages
about each node's slurm.conf not being the same as that on the master.
So, yes, it can work temporarily, but unless there are some very
specific settings done, issues will arise. The state you are in now, you
will wa
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did
similar when I still had to support EL7
Fairly generic plan, the devil is in the details and verifying each
step, but those are the basic bases you need to touch.
Brian Andrus
On 4/10/2024 1:48 PM, Steve Berg via slurm-users
Xaver,
You may want to look at the ResumeRate option in slurm.conf:
ResumeRate
The rate at which nodes in power save mode are returned to normal
operation by ResumeProgram. The value is a number of nodes per
minute and it can be used to prevent power surges if a large number
of no
, Brian Andrus via slurm-users
ha scritto:
Quick correction, it is SaveStateLocation not SlurmSaveState.
Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all, I am having trouble finalizing the configuration of
the backup controller for my slurm
Quick correction, it is SaveStateLocation not SlurmSaveState.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems f
Miriam,
You need to ensure the SlurmSaveState directory is the same for both.
And by 'the same', I mean all contents are exactly the same.
This is usually achieved by using a shared drive or replication.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am hav
Wow, snazzy!
Looks very good. My compliments.
Brian Andrus
On 3/12/2024 11:24 AM, Victoria Hobson via slurm-users wrote:
Our website has gone through some much needed change and we'd love for
you to explore it!
The new SchedMD.com is equipped with the latest information about
Slurm, your favo
Chip,
I use 'sacct' rather than sreport and get individual job data. That is
ingested into a db and PowerBI, which can then aggregate as needed.
sreport is pretty general and likely not the best for accurate
chargeback data.
Brian Andrus
On 3/4/2024 6:09 AM, Chip Seraphine via slurm-users
Joseph,
You will likely get many perspectives on this. I disable swap completely
on our compute nodes. I can be draconian that way. For the workflow
supported, this works and is a good thing.
Other workflows may benefit from swap.
Brian Andrus
On 3/3/2024 11:04 PM, John Joseph via slurm-user
oxy>
Brian Andrus
On 2/28/2024 12:54 PM, Dan Healy wrote:
Are most of us using HAProxy or something else?
On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users
wrote:
Magnus,
That is a feature of the load balancer. Most of them have that
these days.
Brian Andrus
Magnus,
That is a feature of the load balancer. Most of them have that these days.
Brian Andrus
On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:
On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
for us, we put a load balancer in front of the
Josef,
for us, we put a load balancer in front of the login nodes with session
affinity enabled. This makes them land on the same backend node each time.
Also, for interactive X sessions, users start a desktop session on the
node and then use vnc to connect there. This accommodates disconnect
I imagine you could create a reservation for the node and then when you
are completely done, remove the reservation.
Each helper could then target the reservation for the job.
Brian Andrus
On 2/9/2024 5:52 PM, Alan Stange via slurm-users wrote:
Chip,
Thank you for your prompt response. We c
48 matches
Mail list logo