Thank you Brian,
while ResumeRate might be able to keep the CPU usage within an
acceptable margin, it's not really a fix, but a workaround. I would
prefer a solution that groups resume requests and therefore makes use of
a single Ansible playbook run per second instead of <=ResumeRate.
As we com
Agree with that. Plus, of course, even if the jobs run a bit slower by not
having all the cores on a single node, they will be scheduled sooner, so the
overall turnaround time for the user will be better, and ultimately that's what
they care about. I've always been of the view, for any schedu
Hello List,
does anyone have experience with DefCpuPerGPU and jobs requesting
multiple partitions? I would expect Slurm to select a partition from
those requested by the job, then assign CPUs based on that
partition's DefCpuPerGPU. But according to my observations, it appears
that (at least someti
I wrote a little blog post on this topic a few years back:
https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/
It's a vexing problem, but as noted by the other responders it is
something that depends on your cluster policy and job performance needs.
Well written MPI code should be able
Hi,
We have a plugin in Lua that mostly does what we want but there are
features available in the C extension that are not available to lua. For
that reason, we are attempting to convert to C using the guidance found
here: https://slurm.schedmd.com/job_submit_plugins.html#building. We
arrived here
Hi Gerhard,
I am not sure if this counts as administrative measure, but we do
highly encourage our users to always explicitely specify --nodes=n
together with --ntasks-per-node=m (rather than just --ntasks=n*m and
omitting --nodes option, which may lead to cores allocated here and
there and eve
Hi everyone, I'm conducting some tests. I've just set up SLURM on the head
node and haven't added any compute nodes yet. I'm trying to test it to
ensure it's working, but I'm encountering an error: 'Nodes required for the
job are DOWN, DRAINED, or reserved for jobs in higher priority partitions.
*
Hi everyone, I'm conducting some tests. I've just set up SLURM on the head
node and haven't added any compute nodes yet. I'm trying to test it to
ensure it's working, but I'm encountering an error: 'Nodes required for the
job are DOWN, DRAINED, or reserved for jobs in higher priority partitions.
A
Alison
The error message indicates that there are no resources to execute jobs.
Since you haven’t defined any compute nodes you will get this error.
I would suggest that you create at least one compute node. Once, you do that
this error should go away.
Jeff
From: Alison Peterson via slurm-
Hi Jeffrey,
I'm sorry I did add the head node in the compute nodes configuration, this
is the slurm.conf
# COMPUTE NODES
NodeName=head CPUs=24 RealMemory=184000 Sockets=2 CoresPerSocket=6
ThreadsPerCore=2 State=UNKNOWN
PartitionName=lab Nodes=ALL Default=YES MaxTime=INFINITE State=UP
OverSubscr
Alison
Can you provide the output of the following commands:
* sinfo
* scontrol show node name=head
and the job command that your trying to run?
From: Alison Peterson
Sent: Tuesday, April 9, 2024 3:03 PM
To: Jeffrey R. Lang
Cc: slurm-users@lists.schedmd.com
Subject: Re: [EXT] RE: [
Yes! here is the information:
[stsadmin@head ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lab* up infinite 1 down* head
[stsadmin@head ~]$ scontrol show node name=head
Node name=head not found
[stsadmin@head ~]$ sbatch ~/Downloads/test.sh
Submitted batch job 7
[st
Alison
The sinfo shows that your head node is down due to come configuration error.
Are you running slurmd on the head node? If slurmd, is running find the log
file for it and pass along the entries from it.
Can you redo the scontrol command and “node name” should be “nodename” one word.
Aha! That is probably the issue slurmd ! I know slurmd runs on the compute
nodes, I need to deploy this for a lab but I only have one of the servers
with me. I will be adding them 1 by 1 after the first one is set up, to not
disrupt their current setup. I want to be able to use the resources from
t
Glen,
I don't think I see it in your message, but are you pointing to the
plugin in slurm.conf with JobSubmitPlugins=? I assume you are but it's
worth checking.
Ryan
On 4/9/24 10:19, Glen MacLachlan via slurm-users wrote:
Hi,
We have a plugin in Lua that mostly does what we want but there
Alison
In your case since you are using head as both a slurm management node and a
compute node you’ll need to setup slurmd on the head node.
Once the slurmd is running use “sinfo” to see what the status of the node is.
Most likely down hopefully without an astrick. If that’s the case then
Thank you so much!!! I have installed slurmd on the head node. Started and
enabled the service, restarted slurmctld. I sent 2 jobs and they are
running!
[stsadmin@head ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
10 lab test_
Alison
I’m glad I was able to help. Good luck.
Jeff
From: Alison Peterson
Sent: Tuesday, April 9, 2024 4:09 PM
To: Jeffrey R. Lang
Cc: slurm-users@lists.schedmd.com
Subject: Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes
required for job are down, drained or reserved
Than
We are running a slurm cluster with version `slurm 22.05.8`. One of our users
has reported that their jobs have been stuck at the completion stage for a long
time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide we
found that indeed the batchhost for the job was removed from th
19 matches
Mail list logo