Hello Patrick,
Yeah I'd recommend upgrading, and I imagine most others will, too. I have
found with Slurm that upgrades are nearly mandatory, at least annually or
so, mostly because it's more challenging to upgrade from much older
versions and requires bootstrapping. Not sure about the minus sign;
Ours works fine, however, without the InteractiveStepOptions parameter.
JLS
On Thu, Sep 5, 2024 at 9:53 AM Carsten Beyer via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hi Loris,
>
> we use SLURM 23.02.7 (Production) and 23.11.1 (Testsystem). Our config
> contains a second parameter In
I know this doesn't particularly help you, but for me on 23.11.6 it works
as expected and immediately drops me onto the allocated node. In answer to
your question, yes, as I understand it the default/expected behavior is to
return the shell directly.
Jason
On Thu, Sep 5, 2024 at 8:18 AM Loris Ben
On the one hand, you say you want "to *allocate a whole node* for a single
multi-threaded process," but on the other you say you want to allow it
to "*share
nodes* with other running jobs." Those seem like mutually exclusive
requirements.
Jason
On Thu, Aug 1, 2024 at 1:32 PM Henrique Almeida via
Hello all,
The Slurm docs have me a bit confused... I'm wanting to enable job
preemption on certain partitions but not others. I *presume* I would
set PreemptType=preempt/partition_prio globally, but then on the partitions
where I don't want jobs to be able to be preempted, I would set
PreemptMode
r user root in place?
>
> sreport accounts resources reserved for a user as well (even if not
> used by jobs) while sacct reports job accounting only.
>
> Best regards
> Jürgen
>
>
> * Jason Simms via slurm-users [240429
> 10:47]:
> > Hello all,
> >
> >
Hello all,
Each week, I generate an automated report of the top users by CPU hours.
This week, for whatever reason the user root accounted for a massive number
of hours:
Login Proper Name Used A
As a related point, for this reason I mount /var/log separately from /. Ask
me how I learned that lesson...
Jason
On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is p
Hello Matthew,
You may be aware of this already, but most sites would make these kinds of
checks/validations using job_submit.lua. I'm not an expert in that - though
plenty of others on this list are - but I'm positive you could implement
this type of validation logic. I'd like to say that I've co
Hello Thomas,
I know I'm a few days late to this, so I'm wondering whether you've made
any progress. We experience this, too, but in a different way.
First, though, you may be aware, but you should use salloc rather than srun
--pty for an interactive session. That's been the preferred method for
Hello Daniel,
In my experience, if you have a high-speed interconnect such as IB, you
would do IPoIB. You would likely still have a "regular" Ethernet connection
for management purposes, and yes that means both an IB switch and an
Ethernet switch, but that switch doesn't have to be anything specia
Hello all,
I've used the "scontrol write batch_script" command to output the job
submission script from completed jobs in the past, but for some reason, no
matter which job I specify, it tells me it is invalid. Any way to
troubleshoot this? Alternatively, is there another way - even if a manual
da
Hello all,
At least at one point, I understood that it was not particularly possible,
or at least not elegant, to provide priority preempt access to a specific
GPU card. So, if a node has 4 GPUs, a researcher can preempt as needed one
or more of them.
Is this still the case? Or is there a reasona
Hello Michael,
I don't have an elegant solution, but I'm writing mostly to +1 this. I
didn't catch this in the release notes but am concerned if it is indeed the
new behavior. Researchers use scripts that rely on --cpus-per-task (or -c)
as part of, e.g., SBATCH directives. I suppose you could simp
Hello all,
Our template scripts for Slurm include a workflow to copy files to a
scratch space prior to running a job, and then copying any output files,
etc. back to the original submit directory on job completion, and then
finally cleaning up (deleting) the scratch space before exiting. This work
I personally don't think that we should assume users will always know which
partitions are available to them. Ideally, of course, they would, but I
think it's fine to assume users should be able to submit a list of
partitions that they would be fine running their jobs on, and if one is
forbidden fo
Hello John,
I also am keen to follow your progress, as this is something we would find
extremely useful as well.
Regards,
Jason
On Fri, Sep 8, 2023 at 4:47 AM John Snowdon
wrote:
> I've been needing to do this as part of some analysis work we are
> undertaking to determine requirements for a r
a running job?
>
>
>
>
>
> On Thu, 6 Jul 2023, 18:16 Jason Simms, wrote:
>
>> An unfortunate example of the “with great power comes great
>> responsibility” maxim. Linux will gleefully let you rm -fr your entire
>> system, drop production databases, etc., p
r than the time the
>> job had already run, so it killed it immediately?
>>
>> On Jul 6, 2023, at 12:04 PM, Jason Simms wrote:
>>
>> No, not a bug, I would say. When the time limit is reached, that's it,
>> job dies. I wouldn't be aware of a way to ma
No, not a bug, I would say. When the time limit is reached, that's it, job
dies. I wouldn't be aware of a way to manage that. Once the time limit is
reached, it wouldn't be a hard limit if you then had to notify the user and
then... what? How long would you give them to extend the time? Wouldn't be
Hello Purvesh,
I'm not an expert in this, but I expect a common question would be, why are
you wanting to do this? More information would be helpful. On the surface,
it seems like you could just allocate two full nodes to each partition. You
must have a reason why that is unacceptable, however.
M
Hello Victoria,
Sorry to hear that remote attendance is not possible. Is it safe to assume,
however, that it will be archived and viewable after the event?
Warmest regards,
Jason
On Tue, May 30, 2023 at 3:00 PM Victoria Hobson
wrote:
> Hi Sean,
>
> That is correct. There will be no remote or h
Hello all,
A user received an email from Slurm that one of his jobs was preempted.
Normally when a job is preempted, the logs will show something like this:
[2023-03-30T08:19:16.535] [25538.batch] error: *** JOB 25538 ON node07
CANCELLED AT 2023-03-30T08:19:16 DUE TO PREEMPTION ***
[2023-03-30T08
Hello Ole and Hoot,
First, Hoot, thank you for your question. I've managed Slurm for a few
years now and still feel like I don't have a great understanding about
managing or limiting resources.
Ole, thanks for your continued support of the user community with your
documentation. I do wish not onl
So ensure:
>> 1) /opt/slurm/prolog.sh exists on the node(s)
>> 2) the slurmd user is able to execute it
>>
>> I would connect to the node and try to run the command as the slurmd user.
>> Also, ensure the user exists on the node, however you are propagating the
>&g
mmand as the slurmd user.
> Also, ensure the user exists on the node, however you are propagating the
> uids.
>
> Brian ANdrus
>
> On 4/11/2023 9:48 AM, Jason Simms wrote:
>
> Hello all,
>
> Regularly I'm seeing array jobs fail, and the only log info from the
>
Hello all,
Regularly I'm seeing array jobs fail, and the only log info from the
compute node is this:
[2023-04-11T11:41:12.336] error: /opt/slurm/prolog.sh: exited with status
0x0100
[2023-04-11T11:41:12.336] error: [job 26090] prolog failed status=1:0
[2023-04-11T11:41:12.336] Job 26090 already
e
> will open a support case.
> And if time permits we will check if it can be triggered with a vanilla
> kernel.
>
> Regards,
> Hermann
>
> On 3/17/23 21:34, Jason Simms wrote:
> > Hello,
> >
> > This isn't precisely related, but I can say that we
Hello,
This isn't precisely related, but I can say that we were having strange
issues with system load spiking to the point that the nodes became
unresponsive and likewise needed a hard reboot. After several tests and
working with our vendor, on nodes that we entirely disabled swap, the
problems c
Hello all,
I haven't found any guidance that seems to be the current "better
practice," but this does seem to be a common use case. I imagine there are
multiple ways to accomplish this goal. For example, you could assuredly do
it with QoS, but you can likely also accomplish this with some other
we
Oh hey this is fun, thanks for sharing. I hadn't seen this, but it works as
advertised.
Jason
On Thu, Nov 3, 2022 at 12:31 AM Christopher Samuel
wrote:
> On 11/2/22 4:45 pm, Juergen Salk wrote:
>
> > However, instead of using `srun --pty bash´ for launching interactive
> jobs, it
> > is now rec
The oversight is perhaps understandable, since for most software, a given
version XX.YY would be major version XX, minor version YY. But with Slurm
it’s major version XX.YY and minor is XX.YY.zz
On Thu, Sep 8, 2022 at 2:43 PM Ole Holm Nielsen
wrote:
> Paul is right! You may upgrade 18.08 to 2
Hello all,
Slightly OT, but I'm hoping the hive mind here can share some advice.
We have a GPU node with three RTX8000 GPUs installed. The node has a
capacity of 8 cards in total. I have a researcher who possibly wants to add
an A100. I recall asking our vendor a while back whether it's possible
e a NodeName line.
> "scontrol reconfigure" doesn't do the truck.
>
> On Mon, Jul 26, 2021 at 12:49 PM Fulcomer, Samuel <
> samuel_fulco...@brown.edu> wrote:
>
>> If you have a dual-root PCIe system you may need to specify the CPU/core
>> affinity in gres
Hello all,
I have a GPU node with 3 identical GPUs (we started with two and recently
added the third). Running nvidia-smi correctly shows that all three are
recognized. My gres.conf file has only this line:
NodeName=gpu01 File=/dev/nvidia[0-2] Type=quadro_8000 Name=gpu Count=3
And the relevant l
Dear all,
I feel like I've attempted to track this down before but have never fully
understood how to accomplish this.
I have a GPU node with three GPU cards, one of which was purchased by a
user. I want to provide priority access for that user to the card, while
still allowing it to be used by t
still want to make use of the
> cluster. Let's keep the discussion on how to get slurm to do it, if that's
> possible.
>
> On Fri, Jun 4, 2021 at 11:13 AM Jason Simms wrote:
>
>> Unpopular opinion: remove the failing GPU.
>>
>> JLS
>>
>> On Fri,
Unpopular opinion: remove the failing GPU.
JLS
On Fri, Jun 4, 2021 at 2:07 PM Ahmad Khalifa wrote:
> Because there are failing GPUs that I'm trying to avoid.
>
> On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth
> wrote:
>
>> On 03.06.21 07:11, Ahmad Khalifa wrote:
>> > How to send a job to a partic
Hello all,
As usual, I have a super basic question, so thank you for your patience. I
want to verify the correct syntax to configure a GPU for priority preempt
access via a QOS, much like we are currently doing for a specified number
of cores. When I have created a QOS in the past, I've so far onl
Snakemake to manage
> complex depencies in workflows in other contexts. Snakemake should
> support slurm.
>
> HTH,
> Jan
>
>
> On 02-03-2021 20:16, Jason Simms wrote:
> > Hello all,
> >
> > I am relatively new to the nuances of handling complex dependencies in
Hello all,
I am relatively new to the nuances of handling complex dependencies in
Slurm, so I'm hoping the hive mind can help. I have a user wanting to
accomplish the following:
- submit one job
- submit multiple jobs that are dependent on the output from the first
job (so they just need
We’re in the same boat. Extremely small cluster. $10k for support. We don’t
need nearly that level of engagement, but there ya go. We’ve passed for
now, but I’d like to have a support contract ideally.
Jason
On Tue, Jan 26, 2021 at 2:49 PM Robert Kudyba wrote:
>
>
> On Mon, Jan 25, 2021 at 6:36
Dear all,
I have two users on our cluster who "bought into" it, much like a condo
model, by purchasing one single physical node each. For those users, I have
attempted to configure two QOS levels, such that when they submit jobs and
invoke the QOS, they will have preempt, priority access to resour
Hello all,
Thanks to several helpful members on this list, I think I have a much
better handle on how to upgrade Slurm. Now my question is, do most of you
upgrade with each major release?
I recognize that, normally, if something is working well, then don't
upgrade it! In our case, we're running 2
Building RPMs is described in this page as well:
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms
>
> I hope this helps.
>
> /Ole
>
>
> On 04-12-2020 20:36, Jason Simms wrote:
> > Thank you for being such a helpful resource for All Things Slurm; I
> > si
Hello all,
Thank you for being such a helpful resource for All Things Slurm; I
sincerely appreciate the helpful feedback. Right now, we are running 20.02
and considering upgrading to 20.11 during our next maintenance window in
January. This will be the first time we have upgraded Slurm, so
underst
it doesn't for everyone, and
I can't figure out why.
Warmest regards,
Jason
On Wed, Nov 18, 2020 at 12:09 PM Peter Kjellström wrote:
> On Wed, 18 Nov 2020 09:15:59 -0500
> Jason Simms wrote:
>
> > Dear Diego,
> >
> > A while back, I attempted to make some e
Dear Diego,
A while back, I attempted to make some edits locally to see whether I could
produce "better" results. Here is a comparison of the output of your latest
version, and then mine:
[root@hpc bin]# seff 24567
Use of uninitialized value $hash{"2"} in division (/) at /bin/seff line
108, line
Hello all,
I am going to reveal the degree of my inexperience here, but am I perhaps
the only one who thinks that Slurm's upgrade procedure is too complex? Or,
at least maybe not explained in enough detail?
I'm running a CentOS 8 cluster, and to me, I should be able simply to
update the Slurm pac
Hello David,
I'm still relatively new at Slurm, but one way we handle this is that for
users/groups who have "bought in" to the cluster, we use a QOS to provide
them preemptible access to the set of resources provided by, e.g., a set
number of nodes, but not the nodes themselves. That is, in one e
FWIW, I define the DefaultTime as 5 minutes, which effectively means for
any "real" job that users must actually define a time. It helps users get
into that habit, because in the absence of a DefaultTime, most will not
even bother to think critically and carefully about what time limit is
actually
Hello all,
I've found that when I run seff, it fails to report calculated values, e.g.:
Nodes: 1
Cores per node: 20
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 1-11:49:40 core-walltime
Job Wall-clock time: 01:47:29
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 180.0
Hello all,
I have a couple of users, each of whom has contributed funds to purchase a
node for the cluster, much like a condo system. Each node has 52 cores, so
I'd like to provide each user with preempt access for up to 52 cores. I can
configure that easily enough with a QOS for each user with Gr
Hello everyone! We have a script that queries our LDAP server for any users
that have an entitlement to use the cluster, and if they don't already have
an account on the cluster, one is created for them. In addition, they need
to be added to the Slurm database (in order to track usage, FairShare,
e
Hello all,
Later this month, I will have to bring down, patch, and reboot all nodes in
our cluster for maintenance. The two options available to set nodes into a
maintenance mode seem to be either: 1) creating a system-wide reservation,
or 2) setting all nodes into a DRAIN state.
I'm not sure it
The only value currently supported is 0 (zero). This
> is a settable specification only - it cannot be used as a filter to list
> accounts.
>
> See:
>
> https://slurm.schedmd.com/sacctmgr.html
>
> -Paul Edmon-
> On 7/27/2020 2:17 PM, Jason Simms wrote:
>
> Dear
Dear all,
Apologies for the basic question. I've looked around online for an answer
to this, and I haven't found anything that has helped accomplish exactly
what I want. That said, it is also probable that what I am asking isn't a
best practice, or isn't actually necessary, etc. I'd welcome any ad
f Melbourne, Victoria 3010 Australia
>
>
>
> On Wed, 8 Jul 2020 at 01:14, Jason Simms wrote:
>
>> *UoM notice: External email. Be cautious of links, attachments, or
>> impersonation attempts.*
>> --
>> Hello all,
>>
>&
Hello all,
Two users on my system experience job failures every time they submit a job
via sbatch. When I run their exact submission script, or when I create a
local system user and launch from there, the jobs run fine. Here is an
example of what I see in the slurmd log:
[2020-07-06T15:02:41.284]
59 matches
Mail list logo