Hi -
Very delayed response to this, as I'm working my way through a backlog
of slurm-user posts. If this error is intermittent, it's likely a
hardware issue. Recently I ran into an problem where a host with 8 GPUs
was spontaneously rebooting a couple of minutes after a user would start
an 8
t
GPUs as '--gres=gpu:a100:X'.
Tina
On 24/08/2023 23:17, Patrick Goetz wrote:
Hi Mick -
Thanks for these suggestions. I read over both release notes, but
didn't find anything helpful.
Note that I didn't include gres.conf in my original post. That would
be this:
|Name| in your |gres.conf|?
Kind regards
--
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--
--------
*From:* slurm-users on behalf of
Patrick Goetz
*Sent:* Thursday, August 24, 2023 11:27 AM
*To:* Slurm User Commun
de to whatever slurmd -C says, or set
config_overrides in slurm.conf
Rob
*From:* slurm-users on behalf of
Patrick Goetz
*Sent:* Thursday, August 24, 2023 11:27 AM
*To:* Slurm User Community List
*Subject:* [slurm-us
Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)
This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
re-used the original slurm.conf (fearing this might cause issues). The
hardware is the same. The Master and nodes all use the same slurm.conf,
gres.con
Hi -
This was a known bug: https://bugs.schedmd.com/show_bug.cgi?id=3941
However, the bug report says this was fixed in version 17.02.7.
The problem is we're running version 17.11.2, but appear to still have
this bug going on:
[2023-04-18T17:09:42.482] _slurm_rpc_kill_job: REQUEST_KILL_JOB
I think reading the documentation is making me more confused; maybe this
has to do with version changes. My current slurm cluster is using
version 17.x
Looking at the man page for gres.conf
(https://slurm.schedmd.com/gres.conf.html) I see this:
NOTE: Slurm support for gres/[mps|shard] requ
I'm working on an inherited Slurm cluster, and was reading through the
Slurm documentation when I found this in the Easy Configurator section
(https://slurm.schedmd.com/configurator.easy.html)
- cons_tres: Allocate individual processors, memory, GPUs, and other
trackable resources
- Cons_re
On 5/25/21 11:07 AM, Loris Bennett wrote:
PS Am I wrong to be surprised that this is something one needs to roll
oneself? It seems to me that most clusters would want to implement
something similar. Is that incorrect? If not, are people doing
something else? Or did some vendor setting things
Could this be a function of the R script you're trying to run, or are
you saying you get this error running the same script which works at
other times?
On 3/29/21 7:47 AM, Simon Andrews wrote:
I've got a weird problem on our slurm cluster. If I submit lots of R
jobs to the queue then as soon
That sounds like a linux issue. You probably need to reset the max limit
for file descriptors someplace.
Maybe start here:
https://rtcamp.com/tutorials/linux/increase-open-files-limit/
On 2/2/21 11:50 AM, Prentice Bisbal wrote:
Has anyone seen this error message before? A user just reported it
This bug report appears to address the issue you're seeing:
https://bugs.schedmd.com/show_bug.cgi?id=5868
On 2/24/20 4:46 AM, Pär Lundö wrote:
Dear all,
I started testing and evaluating Slurm roughly a year ago and used it
succesfully with MPI-programs.
I have now identified that I need
On 9/8/18 5:11 AM, John Hearns wrote:
Not an answer to your question - a good diagnostic for cgroups is the
utility 'lscgroups'
Where does one find this utility?
On 08/22/2018 10:58 AM, Kilian Cavalotti wrote:
My guess is that you're experiencing first-hand the awesomeness of systemd.
Yes, systemd uses cgroups. I'm trying to understand if the Slurm use of
cgroups is incompatible with systemd, or if there is another way to
resolve this issue?
Look
On 05/25/2018 11:19 AM, Will Dennis wrote:
Not yet time for us... There's problems with U18.04 that render it unusable for
our environment.
What problems have you run in to with 18.04?
Does your SMS have a dedicated interface for node traffic?
On 05/16/2018 04:00 PM, Sean Caron wrote:
I see some chatter on 6818/TCP from the compute node to the SLURM
controller, and from the SLURM controller to the compute node.
The policy is to permit all packets inbound from SLURM controlle
On 05/09/2018 04:14 PM, Nathan Harper wrote:
Yep, exactly the same issue. Our dirty workaround is to ssh -X back into the
same host and it will work.
Hi -
Since I'm having this problem too, can you elaborate? You're ssh -X ing
into a machine and then ssh -X ing back to the original host?
I concur with this. Make sure your nodes are in the /etc/hosts file on
the SMS. Also, if you name them by base + numerical sequence, you can
configure them with a single line in Slurm (using the example below):
NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2
CoresPerSocket=8 Thread
Why wouldn't slurm.conf just go into /etc/slurm?
On 05/03/2018 10:33 AM, Raymond Wan wrote:
Hi Eric,
On Thu, May 3, 2018 at 11:21 PM, Eric F. Alemany wrote:
I will follow your advice. It doesn't hurt to try right (?)
Thank you for your quick reply
No, it doesn't hurt to try. If this was
I don't think the problem Chris is referring to (a SQL injection attack)
is going to apply to you because you're way too small to need to worry
about Slurm accounting, but if it is a concern, install the distro
packages; confirm that things are roughly working and then just take
note of how thi
Hi Chris -
He has 4 nodes and one master. I'm pretty sure he's not going to be
using slurmdbd? Of course something to keep in mind if things work out
so well that his organization is commanding him to order an additional
thousand nodes in 6 months.
On 04/25/2018 07:03 PM, Christopher Samu
Hi Eric -
Did you follow my suggestion of -- on 18.04, mind you; the packages on
16.04 are too old --
- Install the slurmctld package on the SMS (the master)
- Install the slurmd package on the nodes?
You'll still need to do some configuration, but my guess is this will
pull in the neces
On 04/11/2018 02:35 PM, Sean Caron wrote:
As a protest to asking questions on this list and getting solicitations
for pay-for support, let me give you some advice for free :)
Now, now. Paid support is how they keep the project going. You like
using Slurm, right?
I've been using Slurm on a traditional CPU compute cluster, but am now
looking at a somewhat different issue. We recently purchased a single
machine with 10 high end graphics cards to be used for CUDA calculations
and which will shared among a couple of different user groups.
Does it make sen
I forgot to add that you will need to reload the daemon after doing this
(and systemd will probably prompt you to do so).
On 03/22/2018 08:10 AM, Patrick Goetz wrote:
Or even better, don't think about it. If you type
sudo systemctl edit slurmd
this will open an editor. Type your ch
Or even better, don't think about it. If you type
sudo systemctl edit slurmd
this will open an editor. Type your changes into this and save it and
systemd will set up the snippet file for you automatically (in
etc/systemd/system/slurmd.service.d/).
On 03/21/2018 02:14 PM, Ole Holm Nielse
On 02/22/2018 07:50 AM, Christopher Benjamin Coffey wrote:
It’s a big deal if folks use -n when it’s not an mpi program. This is because
the non mpi program is launched n times (instead of once with internal threads)
and will stomp over logs and output files (uncoordinated) leading to poor
per
The simple solution is to tell people not to do this -- that's what I
do. And if that doesn't work threaten to kick them off the system.
On 02/15/2018 09:11 AM, Manuel Rodríguez Pascual wrote:
Hi all,
Although this is not strictly related to Slurm, maybe you can recommend
me some actions to d
What is TRES?
On 02/06/2018 06:03 AM, Christopher Samuel wrote:
On 06/02/18 21:40, Matteo F wrote:
I've tried to limit the number of running job using Qos ->
MaxJobsPerAccount, but this wouldn't stop a user to just fill up the
cluster with fewer (but bigger) jobs.
You probably want to look
On 01/31/2018 03:52 AM, Christopher Samuel wrote:
Short version, add this to slurm.conf:
PropagateResourceLimits NONE
I'm surprised that this isn't the default setting?
On newer systemd-based systems you can just use timedatectl -- I find
this does everything I need it to do. Although I think on RHEL/CentOS
systems timedatectl is just set start chrony, or something like this.
On 01/14/2018 08:11 PM, Lachlan Musicman wrote:
Hi all,
As part of both Munge and S
On 01/17/2018 08:12 AM, Ole Holm Nielsen wrote:
John: I would refrain from installing the old default package
"environment-modules" from the Linux distribution, since it doesn't seem
to be maintained any more.
Lmod, on the other hand, is actively maintained and solves some problems
with the o
32 matches
Mail list logo