On 2025/02/20 21:55, Daniel Letai via slurm-users wrote:
...
Adding AccountingStorageBackupHost pointing to the other node is of course
possible, but will mean different slurm.conf files which slurm will complain
about.
Just thought to note that, in general, it is useful to be aware
that one w
On 2024/12/05 05:37, Daniel Miliate via slurm-users wrote:
I'm trying to send myself an email notification with a custom body/subject
for notification of important events. Is this possible? I've tried a few
things in the script below.
The bits you may have missed:
The emails that Slurm sends o
On 2024/08/19 15:11, Ward Poelmans via slurm-users wrote:
Have a look if you can spot them in:
function slurm_cli_pre_submit(options, pack_offset)
env_json = slurm.json_env()
slurm.log_info("ENV: %s", env_json)
opt_json = slurm.json_cli_options(options)
slurm.log_info("OPTIONS: %
If I supply a
--constraint=
option to an sbatch/salloc/srun, does the arg appear inside
any object that a Lua CLI Filter could access?
I've tried this basic check
if is_unset(options['constraint']) then
slurm_errorf('constraint is unset ')
end
and seen that that
On 2024/07/10 16:25, jack.mellor--- via slurm-users wrote:
We are running slurm 23.02.6.
Our nodes have hyperthreading disabled and we have slurm.conf
set to CPU=32 for each node (each node has 2 processes with 16 cores).
When we allocated a job, such as salloc -n 32, it will allocate
a whole no
On 2023/07/21 00:24, Arsene Marian Alain wrote:
I would like to see the following information of my nodes
"hostname, total mem, free mem and cpus".
So, I used
'sinfo -o "%8n %8m %8e %C"'
but in the output it shows me the memory in MB like "190560" and I
need it in GB (without decimals if poss
ould set a LD_LIBRARY_PATH or similar EnvVar that
might expose your local PMIx library to Slurm jobs, so maybe the
daemons need similar.
Maybe try running the daemon startup with a lookup path set?
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
On 2022/09/27 23:26, Groner, Rob wrote:
I have 2 nodes that offer a "gc" feature. Node t-gc-1202 is "normal", and node
t-gc-1201 is dynamic.
I can successfully remove t-gc-1201 and bring it back dynamically. Once I
bring it back, that node
appears JUST LIKE the "normal" node in the sinfo outp
t; or on an external login node.
Rather than maintain two copies of slurm.conf in the SimpleSync
hierachies, we just maintained one, which has a
Include /etc/opt/slurm/slurm-ctldhost.conf
line in it.
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
On 2022/05/25 03:08, Marko Markoc wrote:
I have been successfully building slurm RPMs for our environment using
rpmbuild and .rpmmacros to set up our custom sysconfdir. This has stopped
working for me with versions 21.08.7 and 21.08.8. RPMs are created but
sysconfdir is set up to default /etc/sl
We have a group of users who occasionally report seeing jobs start without,
for example, $HOME being set.
Looking at the slurmd logs (info level) on our 20.11.8 node, shows the first
instance of an afflicted JobID appearing as
[2022-03-11T00:19:35.117] task/affinity: task_p_slurmd_batch_request:
tch script for this JobID" options but maybe
there are plans to have some kind of
sbatch/srun --rerun_the_script_from_jobid
functionality in the future, in which case pulling the batch script rug
out from underneath the users might not be a good thing.
Any thoughts/insights welcome,
Kevin
On 2021/07/05 11:39, Kevin Buckley wrote:
Upgrade our Cray TDS from 20.11.7 to 20.11.8, without making any
changes to the configuration but am not now seeing job start to
run, whilst seeing messages in the slurmd log akin to these four
Submitted federated JobId=67122494 to tdsname(self
Upgrade our Cray TDS from 20.11.7 to 20.11.8, without making any
changes to the configuration but am not now seeing job start to
run, whilst seeing messages in the slurmd log akin to these four
Submitted federated JobId=67122494 to tdsname(self)
_slurm_rpc_submit_batch_job: JobId=67122494 InitP
Our Cray XCs have been running with a node defintion of
Gres=craynetwork:4
in the config for a good while now, even though we have not
activated the "use" of the Gres via the
JobSubmitPlugins=job_submit/cray_aries
setting.
On our TDS, however, the JobSubmitPlugin was active, and the
upgrade t
Slurm 20.02.5
We have a user who is submitting a job script containing
three heterogeneous srun invocation
#SBATCH --nodes=15
#SBATCH --cpus-per-task=20 --ntasks=1 --ntasks-per-node=1
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=54 --ntasks-per-node=4
#SBATCH hetjob
#SBATCH --cpus-per-task
On 2021/02/17 14:02, Kota Tsuyuzaki wrote:
.. so that I'm realizing that sacct may exhaust the resources more rapidly
than squeue on mysql point of view because, as I understand correctly,
squeue doesn't affects such a mysql db query performance.
Any thoughts?
Best,
Kota
Pulling together so
On 2021/02/10 09:33, Christopher Samuel wrote:
Also getting users to use `sacct` rather than `squeue` to check what
state a job is in can help a lot too, it reduces the load on slurmctld.
That raises an interesting take on the two utilities, Chris,
in that
1) It should be possible to write a
On 2020/12/17 11:34, Chris Samuel wrote:
On 16/12/20 6:21 pm, Kevin Buckley wrote:
The skip is occuring, in src/lua/slurm_lua.c, because of this trap
That looks right to me, that's Doug's code which is checking whether the
file has been updated since slurmctld last read it in.
Probaly not specific to 20.11.1, nor a Cray, but has anyone out there seen
anything like this.
As the slurmctld restarts, after upping the debug level, it all look hunky dory,
[2020-12-17T09:23:46.204] debug3: Trying to load plugin
/opt/slurm/20.11.1/lib64/slurm/job_submit_cray_aries.so
[2020-
On 2020/11/05 17:15, Zacarias Benta wrote:
On 05/11/2020 02:00, Kevin Buckley wrote:
We have had a couple of nodes enter a DRAINED state where scontrol
gives the reason as
Reason=slurm.conf
Hi Kevin,
I have no experience with version 20 of slurm, but probably you have
some misconfiguration
We have had a couple of nodes enter a DRAINED state where scontrol
gives the reason as
Reason=slurm.conf
In looking at the SlurmCtlD log we see pairs of lines as follows
update_node: node nid00245 reason set to: slurm.conf
update_node: node nid00245 state set to DRAINED
A search of the int
On 2020/10/22 16:32, Christian Goll wrote:
The problem comes from the fact that libmunge is part of the SLE-HPC
module and not part of SLE itself. As soon as you activate this module
with SUSEConnect you should be able to build it.
The best way to create rpms for SUSE systems is not to use rpmb
ions using the MUNGE authentication service.
libmunge.so.2()(64bit)
libmunge2 = 0.5.13-4.3.1
libmunge2(x86-64) = 0.5.13-4.3.1
#
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
that a BZip header or lib is there, it would just be
"nice" if the various distros could just acknowledge each other.
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
On 2020/10/20 11:50, Christopher Samuel wrote:
I forgot I do have access to a SLES15 SP1 system, that has:
# rpm -q libmunge2 --provides
libmunge.so.2()(64bit)
libmunge2 = 0.5.14-4.9.1
libmunge2(x86-64) = 0.5.14-4.9.1
munge-libs = 0.5.14
# rpm -q libmunge2
libmunge2-0.5.14-4.9.1.x86_64
So tha
On 2020/10/19 13:58, Chris Samuel wrote:
I've not had problems building Slurm 20.02.x on SLES15 SP0 (CLE7.0 UP01), so
I'm wondering if something big happened with munge in SP1?
According to Chris Dunlap, the vanilla Munge tarball no longer contains
a SUSE-specific SPEC-file, because he has a h
On 2020/10/16 21:38, Alexander Block wrote:
Hi Kevin,
yes, I have build Slurm packages under SLES15 SP1. The trick that worked
for me was to uninstall the SUSE munge packages and build and install
munge from source.
Interestingly enough, Cray's Munge RPMs (at CLE release 6.0.UP07), are built
x27;s Munge RPMs (at CLE release 6.0.UP07), are built
against an earlier Munge version than that found in SLES15 SP1, but also
provide the libs package, vis:
cray-munge-0.5.11-6.0.7.1_1.1__g1b18658.x86_64
cray-munge-libs-0.5.11-6.0.7.1_1.1__g1b18658.x86_64
Kevin Buckley
--
Supercomputing Sy
lthought it seems odd
that the SPEC file inside the Slurm tarball can't recognise that's
on a SLES 15 OS.
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
ed!).
Might help build a groundswell for the feature to be included, as well
as exposing the design approach, its code, and any effects, to a wider
audience.
Apologies if I've missed the patch reference though,
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
On 2020/07/20 20:26, Riebs, Andy wrote:
Ummm... unless I'm missing something obvious, though the choice of the term "defunct" might not be my choice
(I would have expected "deprecated"), it seems quite clear that the new "SlurmctldHost" parameter
has subsumed the 4 that you've listed. I wasn't
At https://slurm.schedmd.com/slurm.conf.html
we read
BackupAddr
Defunct option, see SlurmctldHost.
BackupController
Defunct option, see SlurmctldHost. ...
ControlAddr
Defunct option, see SlurmctldHost.
ControlMachine
Defunct option, see SlurmctldHost.
but what does "Defunct"
Just wondering what Slurm's _handle_stray_script messages are telling
us, for example should we be looking to find a home for the strays,
or is Slurm just letting us know that it's got everything under
control ?
Kevin
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
On 2020/05/21 12:14, Christopher Samuel wrote:
On 5/20/20 7:23 pm, Kevin Buckley wrote:
Are they set as part of the job payload creation, and so would ignore
and node local lookup, or set as the job gets allocated to the various
nodes it will run on?
Looking at git, it's a bit of both:
cript, and had always assumed
the former, as in, a job arrives on a node with the variables alreasy
populated.
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
llJobArray?)
config value, along with an associated
--task-failure-action=[0|1|2(|3)]
command-line option, in it, as that would seem to offer a clearer
"this overrides that" mapping?
Then again, as this wasn't what I was originally looking for/at,
maybe I've missed something.
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Hi there,
in SchedMD issue 6787 (https://bugs.schedmd.com/show_bug.cgi?id=6787),
there was a patch, supplied by Doug Jacobsen, that altered the output
of `scontrol completing` to be akin to the following (have cut-and-pasted
Chris Samuel's example from the issue ticket) when run from the command
On 2019/10/07 05:24, Eliot Moss wrote:
On 10/6/2019 9:23 AM, George Wm Turner wrote:
I stumbled across CRIU (Checkpoint/Restore In Userspace) https://criu.org/Main_Page a couple of
weeks ago. I have not utilized it yet it; it's on my ToDo list. They claim that it’s packaged with
most distress;
On 2019/10/04 03:26, David Rhey wrote:
Whilst we're not looking to provide succour to meta-scheduler writers,
we can see a need for some way to present and/or make use of, a
"job has been in state S for time T"
or
"job entered current state at time T"
info.
Hi there,
we're hoping to overcome an issue where some of our users are keen
on writing their own meta-schedulers, so as to try and beat the
actual scheduler, but can't seemingly do as good a job as a scheduler
that's been developed by people who understand scheduling (no real
surprises there!),
On 2019/08/22 04:51, Jarno van der Kolk wrote:
Hi,
I am helping a researcher who encountered an unexpected behaviour with dependencies. He uses both
"singleton" and "after". The minimal working example is as follows:
$ sbatch --hold fakejob.sh
Submitted batch job 25909273
$ sbatch --hold fakej
On 2019/08/14 03:11, Tim Wickberg wrote:
Slurm version 19.05.2 is now available, and includes a series of minor
bug fixes since 19.05.1 was released over a month ago.
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
Looking at
src/common/read_c
etting in slurm.conf
PriorityFlags=FAIR_TREE
is now redundant, because it's the default ?
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
On 2019/05/09 23:37, Christopher Benjamin Coffey wrote:
Feel free to try it out and let us know how it works for you!
https://github.com/nauhpc/job_archive
So Chris,
testing it out quickly, and dirtily, using an sbatch with a here document, vis:
$ sbatch -p testq <
This actually just tripped me up on a Cray, but I belive the observation
is still worthy of discussion.
If I take the
slurm-19.05.0.tar.bz2
tarball from the SchedMD download site, and then do a direct RPM
build on it, so
rpmbuild -ta slurm-19.05.0.tar.bz2
what I end up generating are the f
I happened to be reading the NERSC website's news article
https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2017/nersc-co-hosts-2017-slurm-user-group-meeting/
which searching for a particular talk.
The NERSC news article contains a link to the SchedMD website behind
the "xl
kind; does SLURM use, internally, negative
step IDs that don't usally enter the public consciousness via its logging,
or is this telling us something else ?
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Eml: kevin.buck...@pawsey.org.au
at,
you can get ALL of the messages on SOME of the channels, SOME of the time
sadly it would appear that,
you can't get ALL of the messages on ALL of the channels, ALL of the time
Am I a fool for thinking that you should be able to ?
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
It has been pointed by someone here, who clearly reads the
documentation a lot closer than most, that there appears to
be an inconsistency in the way that the SLURM documentation,
explains the link between CompleteWait and KillWait.
The default values for CompleteWait and KillWait are
(cut and p
Is there a way to rename a Reservation ?
Looking at the scontrol docs, it would seem that the UPDATE sub-command
keys off of the reservation name, but what if one wanted to rename the
reservation itself ?
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
51 matches
Mail list logo