Hello,
I'm new to Slurm (coming from PBS), and so I will likely have a few
questions over the next several weeks, as I work to transition my
infrastructure from PBS to Slurm.
My first question has to do with *adding nodes to Slurm*. According to the
FAQ (and other articles I've read), you need t
I'm working on populating slurm.conf on my nodes, and I noticed that slurmd
-C doesn't agree with lscpu, in all cases, and I'm not sure why. Here is
what lscpu reports:
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
And here is what slurmd -C is reporting:
NodeName=devops2
Hello all. My team is enabling slurm (version 20.11.5) in our environment,
and we got a controller up and running, along with 2 nodes. Everything was
working fine. However, when we try to enable configless mode, I ran into a
problem. The node that has a GPU is coming up in "drained" state, and
s
ll
>
>
> ------
> *From:* slurm-users on behalf of
> David Henkemeyer
> *Sent:* Friday, May 7, 2021 2:41:41 PM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [slurm-users] Configless mode enabling issue
>
> Hello all. My team is enabling
We are transitioning from PBS to Slurm. In PBS, we use the following
syntax to add/remove properties to a node:
qmgr -c "set node properties += "
qmgr -c "set node properties -= "
Is there a similar way to do this for Slurm? Or is it expected that the
administrator will manually edit slurm.co
Hello,
I just added a 3rd node to my slurm partition (called "hsw5"), as we
continue to enable Slurm in our environment. But the new node is not
accepting jobs that require a GPU, despite the fact that it has 3 GPUs.
The other node that has a GPU ("devops3") is accepting GPU jobs as
expected. A
Hello,
I am investigating Slurm's ability to do requeuing of jobs. I like the
fact that I can set RequeueExit= in the slurm.conf file,
since this will automatically requeue jobs that exit with the specified
exit codes. But, is there a way to limit the # of requeues?
Thanks
David
If I execute a bunch of sbatch commands, can I use sacct (or something
else) to show me the original sbatch command line for a given job ID?
Thanks
David
Hello,
I just noticed today that when I run "sinfo --states=idle", I get all the
idle nodes, plus an additional node that is in the "DRAIN" state (notice
how xavier6 is showing up below, even though its not in the idle state):
(! 807)-> sinfo --states=idle
PARTITION AVAIL TIMELIMIT NODES STATE
All,
When my team used PBS, we had several nodes that had a TON of CPUs, so
many, in fact, that we ended up setting np to a smaller value, in order to
not starve the system of memory.
What is the best way to do this with Slurm? I tried modifying # of CPUs in
the slurm.conf file, but I noticed th
Hello,
A few weeks ago, we tested Slurm against about 50K jobs, and observed at
least one instance where a node went idle, while there were jobs on the
queue that could have run on the idle node. The best guess as to why this
occurred, at this point, is that the default_queue_depth was set to the
Assuming -N is 1 (meaning, this job needs only one node), then is there a
difference between any of these 3 flag combinations:
-n 64 (leaving cpus-per-task to be the default of 1)
--cpus-per-task 64 (leaving -n to be the default of 1)
--cpus-per-task 32 -n 2
As far as I can tell, there is no fun
is significant.
>
> > On Mar 24, 2022, at 12:32 PM, David Henkemeyer <
> david.henkeme...@gmail.com> wrote:
> >
> > Assuming -N is 1 (meaning, this job needs only one node), then is there
> a difference between any of these 3 flag combinations:
> >
> > -n
; will likely bite you in the end. E.g., the 64 thread case should do
> "--cpus-per-task 64", and the launching processes in the loop should
> _probably_ do "-n 64" (assuming it can handle the tasks being assigned to
> different nodes).
>
> On Thu, Mar 24, 2022 at 3:
We noticed that we can pass --cpu_bind into an srun commandline, but not
sbatch. Why is that?
Thanks
David
If I have a large number of heterogeneously named nodes in my cluster, and
several partitions that include the same large subset of those nodes, I
would love to be able to define an env var, and reference that in each
partition specification. For instance, say we have the following:
PartitionName
uded as part of this
> nodeset.
> Nodes
> List of nodes in this set.
> NodeSet
> Unique name for a set of nodes. Must not overlap with any NodeName
> definitions.
>
> Brian Andrus
>
>
> On 4/4/2022 1:08 PM, David Henkemeyer wrote:
>
> If I have a large nu
All,
I'm wanting to improve our daily Slurm job reports. Can anyone point me to
some good examples? Currently we are reporting on several things, such as
# of jobs that failed to schedule, # of jobs that failed during execution,
node utilization, etc, but the report itself is pretty basic and not
I have found that the "reason" field doesn't get updated after you correct
the issue. For me, its only when I move the node back to the idle state,
that the reason field is then reset. So, assuming /dev/nvidia[0-3] is
correct (I've never seen otherwise with nvidia GPUs), then try taking them
back
I am seeing what I think might be a bug with sacct. When I do the
following:
*> sbatch --export=NONE --wrap='uname -a' --exclusive*
*Submitted batch job 2869585*
Then, I ask sacct for the SubmitLine, as such:
*> sacct -j 2869586 -o
"SubmitLine%-70"SubmitLine-
--
sbatch --export=NONE --wrap=uname -a --exclusive
So, its storing properly, now I need to see if I can figure out how to
preserve/add the quotes on the way out of the DB...
David
On Wed, May 4, 2022 at 11:15 AM Michael Jennings wrote:
> On Wednesday, 04 May 2022, at 10:00:57 (-0700),
> Davi
Prologue is a feature whereby I can run something after a single job. Is
there a best practice for running a job after a set of jobs?
We submit a bunch of jobs to a bunch of nodes, and after all the jobs are
done, we would like to submit a "utility job" on each node, but it has to
be the last job
Question for the braintrust:
I have 3 partitions:
- Partition A_highpri: 80 nodes
- Partition A_lowpri: same 80 nodes
- Partition B_lowpri: 10 different nodes
There is no overlap between A and B partitions.
Here is what I'm observing. If I fill the queue with ~20-30k jobs for
partiti
5000 jobs being considered, the
> remaining aren't even looked at.
>
> Brian Andrus
> On 5/12/2022 7:34 AM, David Henkemeyer wrote:
>
> Question for the braintrust:
>
> I have 3 partitions:
>
>- Partition A_highpri: 80 nodes
>- Partition A_lowpri: same 80 n
I would like to remove the restriction that users must be at least operator
level to do "scontrol create reservation". So, either I could:
- Change the default AdminLevel to operator. Is that possible?
- Remove the restriction that a user has to be operator to create a
reservation. Is
25 matches
Mail list logo