Hi Steven,
yes, you have the syntax a bit wrong. If you consult the documentation
(or the man-page) of slurm.conf you find this in the "NODE
CONFIGURATION" section (in the paragraph about "NodeName"):
Note that if the short form of the hostname is not used, it may prevent
use of hostlist ex
Hi Paul,
On 8/9/24 18:45, Paul Edmon via slurm-users wrote:
As I recall I think OpenMPI needs a list that has an entry on each line,
rather than one seperated by a space. See:
[root@holy7c26401 ~]# echo $SLURM_JOB_NODELIST
holy7c[26401-26405]
[root@holy7c26401 ~]# scontrol show hostnames $SLUR
Dear Xaver,
we have a similar setup and yes, we have set the node to "state=DRAIN".
Slurm keeps it this way until you manually change it to e.g. "state=RESUME".
Regards,
Hermann
On 6/24/24 13:54, Xaver Stiensmeier via slurm-users wrote:
Dear Slurm users,
in our project we exclude the master f
Hi Michael,
if you submit a job-array, all resources related options (number of
nodes, tasks, cpus per task, memory, time, ...) are meant *per array-task*.
So in your case you start 100 array-tasks (you could also call them
"sub-jobs") that *each* (not your whole job) is limited to one node, on
d65-09f0-9897ba057a0c
GPU-2d971e69-8147-8221-a055-e26573950f91
GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b
0,1,2,3
0,1,2,3
Task completed.
When for the command echo $CUDA_VISIBLE_DEVICES I should get:
0,1,2,3
0,1,2,3,4,5,6,7
This for the some reason that I had problems with hostname?
Thank you,
Mihai
18ae1af5d
GPU-dfec21c4-e30d-5a36-599d-eef2fd354809
GPU-15a11fe2-33f2-cd65-09f0-9897ba057a0c
GPU-2d971e69-8147-8221-a055-e26573950f91
GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b
Job finished at: Tue May 28 13:03:20 EEST 2024
...I'm not interesting on the output of the other 'echo' co
Hi Mihai,
this is a problem that is not Slurm related. It's rather about:
"when does command substitution happen?"
When you write
srun echo Running on host: $(hostname)
$(hostname) is replaced by the output of the hostname-command *before*
the line is "submitted" to srun. Which means that s
Hi everbody,
On 5/26/24 08:40, Ole Holm Nielsen via slurm-users wrote:
[...]
Whether or not to enable Hyper-Threading (HT) on your compute nodes
depends entirely on the properties of applications that you wish to run
on the nodes. Some applications are faster without HT, others are
faster wit
Hi Zhao,
my guess is that in your faster case you are using hyperthreading
whereas in the Slurm case you don't.
Can you check what performance you get when you add
#SBATCH --hint=multithread
to you slurm script?
Another difference between the two might be
a) the communication channel/interf
Hi Dj,
could be a memory-limits related problem. What is the output of
ulimit -l -m -v -s
in both interactive job-shells?
You are using cgroups-v1 now, right?
In that case what is the respective content of
/sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes
in both shell
Hi Dietmar,
what do you find in the output-file of this job
sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'
On our 64 cores machines with enabled hyperthreading I see e.g.
Cpus_allowed: 0400,,0400,
Cpus_allowed_list: 58,122
Greetings
Hermann
Hi Christine,
yes, you can either set the environment variable SLURM_CONF to the full
path of the configuration-file you want to use and then run any program.
Or you can do it like this
SLURM_CONF=/your/path/to/slurm.conf sinfo|sbatch|srun|...
But I am not quite sure if this is really the be
ave one - just for testing purposes. Could this be the issue?
Best regards,
Xaver Stiensmeier
On 17.07.23 14:11, Hermann Schwärzler wrote:
> Hi Xaver,
>
> what kind of SelectType are you using in your slurm.conf?
>
> Per
https://nam10.safelinks.protection.outlook.com/?u
Hi Xaver,
what kind of SelectType are you using in your slurm.conf?
Per https://slurm.schedmd.com/gres.html you have to consider:
"As for the --gpu* option, these options are only supported by Slurm's
select/cons_tres plugin."
So you can use "--gpus ..." only when you state
SelectType
/fs/cgroup type cgroup2
(rw,nosuid,nodev,noexec,relatime,nsdelegate)
Distribution and kernel
RedHat 8.7
4.18.0-348.2.1.el8_5.x86_64
-Original Message-----
From: slurm-users On Behalf Of Hermann
Schwärzler
Sent: Wednesday, July 12, 2023 4:36 AM
To: slurm-users@lists.schedmd.com
Hi Jenny,
I *guess* you have a system that has both cgroup/v1 and cgroup/v2 enabled.
Which Linux distribution are you using? And which kernel version?
What is the output of
mount | grep cgroup
What if you do not restrict the cgroup-version Slurm can use to
cgroup/v2 but omit "CgroupPlugin=...
Hi everybody,
I would like to give you a quick update on this problem (hanging systems
when swapping due to cgroup memory-limits is happening):
We had opened a case with RedHat's customer support. After some to and
fro they could reproduce the problem. Last week they told us to upgrade
to ve
Hi Ángel,
which version of cgroups does Ubuntu 22.04 use?
What is the output of
mount | grep cgroup
on your system?
Regards,
Hermann
On 4/21/23 14:33, Angel de Vicente wrote:
Hello,
I've installed Slurm in a workstation (this is a single-node install)
with Ubuntu 22.04, and have installed Sl
Hi Marcus,
I am not sure if this is helpful but from looking at the source code of
Slurm (line 276 of src/slurmd/slurmstepd/ulimits.c in version 22.05) it
looks like you are explicitly using
"--propagate..."
to set resource limits (the one you see when running
"ulimit -a") on the workers the s
Hi everybody,
in our new cluster we have configured Slurm with
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
which I think is quite a usual setup.
After installing Intel MPI (using Spack v0.19) we saw that th
ediate consequences.
Warmest regards,
Jason
On Thu, Mar 16, 2023 at 10:59 AM Hermann Schwärzler
mailto:hermann.schwaerz...@uibk.ac.at>>
wrote:
Dear Slurm users,
after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each -
Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.
Dear Slurm users,
after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each -
Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.0.1 - Slurm 22.05) for
"friendly user" test operation about 6 weeks ago we were soon facing
serious problems with nodes that suddenly become unresponsive (so m
omplete nonsense, please let me know!
Best wishes,
Sebastian
On 11.02.23 11:13, Hermann Schwärzler wrote:
Hi Sebastian,
we did a similar thing just recently.
We changed our node settings from
NodeName=DEFAULT CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32
ThreadsPerCore=2
to
NodeName=DE
Hi Sebastian,
we did a similar thing just recently.
We changed our node settings from
NodeName=DEFAULT CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32
ThreadsPerCore=2
to
NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=32
ThreadsPerCore=2
in order to make use of individua
therwise cause the
node to drain.
Maybe this helps.
Kind regards
Sebastian
PS: goslmailer looks quite nice with its recommendations! Will definitely look
into it.
--
Westfälische Wilhelms-Universität (WWU) Münster
WWU IT
Sebastian Potthoff (eScience / HPC)
Am 15.09.2022 um 18:07 schrieb
Hi Ole,
On 9/15/22 5:21 PM, Ole Holm Nielsen wrote:
On 15-09-2022 16:08, Hermann Schwärzler wrote:
Just out of curiosity: how do you insert the output of seff into the
out-file of a job?
Use the "smail" tool from the slurm-contribs RPM and set this in
slurm.conf:
MailProg=/usr
Hi Loris,
we try to achieve the same (I guess) - which is nudging the users in the
direction of using scarce resources carefully - by using goslmailer
(https://github.com/CLIP-HPC/goslmailer) and a (not yet published - see
https://github.com/CLIP-HPC/goslmailer/issues/20) custom connector to
Hi Eg.
if you are using cgroups (as you do if I read your other post correctly)
these two lines in your cgroup.conf should do the trick:
ConstrainSwapSpace=yes
AllowedSwapSpace=0
Regards,
Hermann
PS: BTW we are planning to *not* use this setting as right now we are
looking into allowing job
Hi Purvesh,
which version of Slurm are you using?
In which OS-environment?
The epilog script is run *on every node when a user's job completes*.
So:
* Do you have copied your epilog script to all of your nodes?
* Did you look at /tmp/ on nodes a job ran recently to see if there is
any output o
Hi Petar,
thanks for letting us know!
We will definitely look into this and will get back to you on GitHub
when technical questions/problems arise.
Just one quick question: we are neither using Telegram nor MS-Teams
here, but Matrix. In case we would like to deliver messages through
that: wh
Hi GHui,
fyi: I am not a podman-expert so my questions might be stupid. :-)
From what you told us so far you are running the podman-command as
non-root but you are root inside the container, right?
What is the output of "podman info | grep root" in your case?
How are you submitting a job fro
Hi GHui,
I have a few questions regarding your mail:
* What kind of container are you using?
* How exactly do you switch to a different user inside the container?
Regards,
Hermann
On 5/16/22 7:53 AM, GHui wrote:
I fount a serious problem. If I run a container on a common user, eg. tom. In
c
Hi Bjørn-Helge,
hi everone,
ok, I see. I also just re-read the documentation to find this in the
description of the "CPUs" option:
"This can be useful when you want to schedule only the cores on a
hyper-threaded node. If CPUs is omitted, its default will be set equal
to the product of Boards
Hi Durai,
I see the same thing as you on our test-cluster that has
ThreadsPerCore=2
configured in slurm.conf
The double-foo goes away with this:
srun --cpus-per-task=1 --hint=nomultithread echo foo
Having multithreading enabled leads to imho surprising behaviour of
Slurm. My impression is that
Hi everybody,
for forcing a run of your config management as Tina suggested you might
just add a
ExecStartPre=
line to your slurmd.service file?
This is somewhat unrelated to your problem but we are very successfully
using
ExecStartPre=-/usr/bin/nvidia-smi -L
in our slurmd.service file t
Dear Nousheen,
I guess there is something missing in your installation - proably your
slurm.conf?
Do you have logging enabled for slurmctld? If yes what do you see in
that log?
Or what do you get if you run slurmctld manually like this:
/usr/local/sbin/slurmctld -D
Regards,
Hermann
On 1/3
Hi Adrian,
ConstrainRAMSpace=yes
has the effect that when the memory the job requested is exhausted the
processes of the job will start paging/swapping.
If you want to stop jobs that use more memory (RSS to be precise) than
they reqeusted, you have to add this to your cgroup.conf:
Constrai
Hi Michał,
hi everyone,
we are having similar issues looming at the horizon (sensitive medical
and human genetic data). :-)
We are currently looking into telling our users to use EncFS
(https://en.wikipedia.org/wiki/EncFS) for this. As it is a filesystem in
user-space unprivileged users can
Hi Rodrigo,
a possible solution is using
VSizeFactor=100
in slurm.conf.
With this settings, programs that try to allocate more memory than
requested in the job's settings will fail.
Be aware that this puts a limit on *virtual* memory, not on RSS. This
might or might not be what you want as
39 matches
Mail list logo