[slurm-users] Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-25 Thread Kevin Buckley via slurm-users
On 2025/02/20 21:55, Daniel Letai via slurm-users wrote: ... Adding AccountingStorageBackupHost pointing to the other node is of course possible, but will mean different slurm.conf files which slurm will complain about. Just thought to note that, in general, it is useful to be aware that one w

[slurm-users] Re: Non-Standard Mail Notification in Job

2024-12-04 Thread Kevin Buckley via slurm-users
On 2024/12/05 05:37, Daniel Miliate via slurm-users wrote: I'm trying to send myself an email notification with a custom body/subject for notification of important events. Is this possible? I've tried a few things in the script below. The bits you may have missed: The emails that Slurm sends o

[slurm-users] Re: Access to --constraint= in Lua cli_filter?

2024-08-19 Thread Kevin Buckley via slurm-users
On 2024/08/19 15:11, Ward Poelmans via slurm-users wrote: Have a look if you can spot them in: function slurm_cli_pre_submit(options, pack_offset) env_json = slurm.json_env() slurm.log_info("ENV: %s", env_json) opt_json = slurm.json_cli_options(options) slurm.log_info("OPTIONS: %

[slurm-users] Access to --constraint= in Lua cli_filter?

2024-08-18 Thread Kevin Buckley via slurm-users
If I supply a --constraint= option to an sbatch/salloc/srun, does the arg appear inside any object that a Lua CLI Filter could access? I've tried this basic check if is_unset(options['constraint']) then slurm_errorf('constraint is unset ') end and seen that that

[slurm-users] Re: Nodes TRES double what is requested

2024-07-11 Thread Kevin Buckley via slurm-users
On 2024/07/10 16:25, jack.mellor--- via slurm-users wrote: We are running slurm 23.02.6. Our nodes have hyperthreading disabled and we have slurm.conf set to CPU=32 for each node (each node has 2 processes with 16 cores). When we allocated a job, such as salloc -n 32, it will allocate a whole no

Re: [slurm-users] slurm sinfo format memory

2023-07-21 Thread Kevin Buckley
On 2023/07/21 00:24, Arsene Marian Alain wrote: I would like to see the following information of my nodes "hostname, total mem, free mem and cpus". So, I used 'sinfo -o "%8n %8m %8e %C"' but in the output it shows me the memory in MB like "190560" and I need it in GB (without decimals if poss

Re: [slurm-users] Trying to troubleshoot slurmctld start failure

2022-10-12 Thread Kevin Buckley
ould set a LD_LIBRARY_PATH or similar EnvVar that might expose your local PMIx library to Slurm jobs, so maybe the daemons need similar. Maybe try running the daemon startup with a lookup path set? Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] Questions about dynamic nodes

2022-09-27 Thread Kevin Buckley
On 2022/09/27 23:26, Groner, Rob wrote: I have 2 nodes that offer a "gc" feature. Node t-gc-1202 is "normal", and node t-gc-1201 is dynamic. I can successfully remove t-gc-1201 and bring it back dynamically. Once I bring it back, that node appears JUST LIKE the "normal" node in the sinfo outp

Re: [slurm-users] Use cases for "include" in slurm.conf?

2022-09-21 Thread Kevin Buckley
t; or on an external login node. Rather than maintain two copies of slurm.conf in the SimpleSync hierachies, we just maintained one, which has a Include /etc/opt/slurm/slurm-ctldhost.conf line in it. Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] rpmbuild with custom sysconfdir not working in 21.08.8

2022-05-24 Thread Kevin Buckley
On 2022/05/25 03:08, Marko Markoc wrote: I have been successfully building slurm RPMs for our environment using rpmbuild and .rpmmacros to set up our custom sysconfdir. This has stopped working for me with versions 21.08.7 and 21.08.8. RPMs are created but sysconfdir is set up to default /etc/sl

[slurm-users] The 8 second default for sbatch's --get-user-env: is it "the only default"

2022-03-15 Thread Kevin Buckley
We have a group of users who occasionally report seeing jobs start without, for example, $HOME being set. Looking at the slurmd logs (info level) on our 20.11.8 node, shows the first instance of an afflicted JobID appearing as [2022-03-11T00:19:35.117] task/affinity: task_p_slurmd_batch_request:

[slurm-users] 21.08: Removing batch scripts from the database

2021-10-04 Thread Kevin Buckley
tch script for this JobID" options but maybe there are plans to have some kind of sbatch/srun --rerun_the_script_from_jobid functionality in the future, in which case pulling the batch script rug out from underneath the users might not be a good thing. Any thoughts/insights welcome, Kevin

Re: [slurm-users] 20.11.8: Altered federation code ? "siblings not synced yet" messages

2021-07-04 Thread Kevin Buckley
On 2021/07/05 11:39, Kevin Buckley wrote: Upgrade our Cray TDS from 20.11.7 to 20.11.8, without making any changes to the configuration but am not now seeing job start to run, whilst seeing messages in the slurmd log akin to these four Submitted federated JobId=67122494 to tdsname(self

[slurm-users] 20.11.8: Altered federation code ? "siblings not synced yet" messages

2021-07-04 Thread Kevin Buckley
Upgrade our Cray TDS from 20.11.7 to 20.11.8, without making any changes to the configuration but am not now seeing job start to run, whilst seeing messages in the slurmd log akin to these four Submitted federated JobId=67122494 to tdsname(self) _slurm_rpc_submit_batch_job: JobId=67122494 InitP

[slurm-users] Cray specifc: Removing the craynetwork GRES

2021-05-11 Thread Kevin Buckley
Our Cray XCs have been running with a node defintion of Gres=craynetwork:4 in the config for a good while now, even though we have not activated the "use" of the Gres via the JobSubmitPlugins=job_submit/cray_aries setting. On our TDS, however, the JobSubmitPlugin was active, and the upgrade t

[slurm-users] Oddities with heterogeneous jobs

2021-04-18 Thread Kevin Buckley
Slurm 20.02.5 We have a user who is submitting a job script containing three heterogeneous srun invocation #SBATCH --nodes=15 #SBATCH --cpus-per-task=20 --ntasks=1 --ntasks-per-node=1 #SBATCH hetjob #SBATCH --cpus-per-task=1 --ntasks=54 --ntasks-per-node=4 #SBATCH hetjob #SBATCH --cpus-per-task

Re: [slurm-users] Rate Limiting of RPC calls

2021-02-19 Thread Kevin Buckley
On 2021/02/17 14:02, Kota Tsuyuzaki wrote: .. so that I'm realizing that sacct may exhaust the resources more rapidly than squeue on mysql point of view because, as I understand correctly, squeue doesn't affects such a mysql db query performance. Any thoughts? Best, Kota Pulling together so

Re: [slurm-users] Rate Limiting of RPC calls

2021-02-09 Thread Kevin Buckley
On 2021/02/10 09:33, Christopher Samuel wrote: Also getting users to use `sacct` rather than `squeue` to check what state a job is in can help a lot too, it reduces the load on slurmctld. That raises an interesting take on the two utilities, Chris, in that 1) It should be possible to write a

Re: [slurm-users] 20.11.1 on Cray: job_submit.lua: SO loaded on CtlD restart: script skipped when job submitted

2020-12-17 Thread Kevin Buckley
On 2020/12/17 11:34, Chris Samuel wrote: On 16/12/20 6:21 pm, Kevin Buckley wrote: The skip is occuring, in src/lua/slurm_lua.c, because of this trap That looks right to me, that's Doug's code which is checking whether the file has been updated since slurmctld last read it in.

[slurm-users] 20.11.1 on Cray: job_submit.lua: SO loaded on CtlD restart: script skipped when job submitted

2020-12-16 Thread Kevin Buckley
Probaly not specific to 20.11.1, nor a Cray, but has anyone out there seen anything like this. As the slurmctld restarts, after upping the debug level, it all look hunky dory, [2020-12-17T09:23:46.204] debug3: Trying to load plugin /opt/slurm/20.11.1/lib64/slurm/job_submit_cray_aries.so [2020-

Re: [slurm-users] update_node / reason set to: slurm.conf / state set to DRAINED

2020-11-08 Thread Kevin Buckley
On 2020/11/05 17:15, Zacarias Benta wrote: On 05/11/2020 02:00, Kevin Buckley wrote: We have had a couple of nodes enter a DRAINED state where scontrol gives the reason as Reason=slurm.conf Hi Kevin, I have no experience with version 20 of slurm, but probably you have some misconfiguration

[slurm-users] update_node / reason set to: slurm.conf / state set to DRAINED

2020-11-04 Thread Kevin Buckley
We have had a couple of nodes enter a DRAINED state where scontrol gives the reason as Reason=slurm.conf In looking at the SlurmCtlD log we see pairs of lines as follows update_node: node nid00245 reason set to: slurm.conf update_node: node nid00245 state set to DRAINED A search of the int

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-25 Thread Kevin Buckley
On 2020/10/22 16:32, Christian Goll wrote: The problem comes from the fact that libmunge is part of the SLE-HPC module and not part of SLE itself. As soon as you activate this module with SUSEConnect you should be able to build it. The best way to create rpms for SUSE systems is not to use rpmb

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-21 Thread Kevin Buckley
ions using the MUNGE authentication service. libmunge.so.2()(64bit) libmunge2 = 0.5.13-4.3.1 libmunge2(x86-64) = 0.5.13-4.3.1 # Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-20 Thread Kevin Buckley
that a BZip header or lib is there, it would just be "nice" if the various distros could just acknowledge each other. Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-20 Thread Kevin Buckley
On 2020/10/20 11:50, Christopher Samuel wrote: I forgot I do have access to a SLES15 SP1 system, that has: # rpm -q libmunge2 --provides libmunge.so.2()(64bit) libmunge2 = 0.5.14-4.9.1 libmunge2(x86-64) = 0.5.14-4.9.1 munge-libs = 0.5.14 # rpm -q libmunge2 libmunge2-0.5.14-4.9.1.x86_64 So tha

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-19 Thread Kevin Buckley
On 2020/10/19 13:58, Chris Samuel wrote: I've not had problems building Slurm 20.02.x on SLES15 SP0 (CLE7.0 UP01), so I'm wondering if something big happened with munge in SP1? According to Chris Dunlap, the vanilla Munge tarball no longer contains a SUSE-specific SPEC-file, because he has a h

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-18 Thread Kevin Buckley
On 2020/10/16 21:38, Alexander Block wrote: Hi Kevin, yes, I have build Slurm packages under SLES15 SP1. The trick that worked for me was to uninstall the SUSE munge packages and build and install munge from source. Interestingly enough, Cray's Munge RPMs (at CLE release 6.0.UP07), are built

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-18 Thread Kevin Buckley
x27;s Munge RPMs (at CLE release 6.0.UP07), are built against an earlier Munge version than that found in SLES15 SP1, but also provide the libs package, vis: cray-munge-0.5.11-6.0.7.1_1.1__g1b18658.x86_64 cray-munge-libs-0.5.11-6.0.7.1_1.1__g1b18658.x86_64 Kevin Buckley -- Supercomputing Sy

[slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-15 Thread Kevin Buckley
lthought it seems odd that the SPEC file inside the Slurm tarball can't recognise that's on a SLES 15 OS. Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Kevin Buckley
ed!). Might help build a groundswell for the feature to be included, as well as exposing the design approach, its code, and any effects, to a wider audience. Apologies if I've missed the patch reference though, Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] Meaning of "defunct" in description of Slurm parameters

2020-07-20 Thread Kevin Buckley
On 2020/07/20 20:26, Riebs, Andy wrote: Ummm... unless I'm missing something obvious, though the choice of the term "defunct" might not be my choice (I would have expected "deprecated"), it seems quite clear that the new "SlurmctldHost" parameter has subsumed the 4 that you've listed. I wasn't

[slurm-users] Meaning of "defunct" in description of Slurm parameters

2020-07-19 Thread Kevin Buckley
At https://slurm.schedmd.com/slurm.conf.html we read BackupAddr Defunct option, see SlurmctldHost. BackupController Defunct option, see SlurmctldHost. ... ControlAddr Defunct option, see SlurmctldHost. ControlMachine Defunct option, see SlurmctldHost. but what does "Defunct"

[slurm-users] Implications of _handle_stray_script messages

2020-07-14 Thread Kevin Buckley
Just wondering what Slurm's _handle_stray_script messages are telling us, for example should we be looking to find a home for the strays, or is Slurm just letting us know that it's got everything under control ? Kevin -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] Are SLURM_JOB_USER and SLURM_JOB_UID always constant and available

2020-05-21 Thread Kevin Buckley
On 2020/05/21 12:14, Christopher Samuel wrote: On 5/20/20 7:23 pm, Kevin Buckley wrote: Are they set as part of the job payload creation, and so would ignore and node local lookup, or set as the job gets allocated to the various nodes it will run on? Looking at git, it's a bit of both:

[slurm-users] Are SLURM_JOB_USER and SLURM_JOB_UID always constant and available

2020-05-20 Thread Kevin Buckley
cript, and had always assumed the former, as in, a job arrives on a node with the variables alreasy populated. Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

[slurm-users] KillOnBadExit or srun's -K: step, job, task, process all get a mention in dispatches

2020-05-19 Thread Kevin Buckley
llJobArray?) config value, along with an associated --task-failure-action=[0|1|2(|3)] command-line option, in it, as that would seem to offer a clearer "this overrides that" mapping? Then again, as this wasn't what I was originally looking for/at, maybe I've missed something. Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

[slurm-users] Observation on SchedMD issue 6787: Add EndTime, CompletingTime to output of `scontrol completing`

2019-10-28 Thread Kevin Buckley
Hi there, in SchedMD issue 6787 (https://bugs.schedmd.com/show_bug.cgi?id=6787), there was a patch, supplied by Doug Jacobsen, that altered the output of `scontrol completing` to be akin to the following (have cut-and-pasted Chris Samuel's example from the issue ticket) when run from the command

Re: [slurm-users] [External] Re: Status of BLCR?

2019-10-14 Thread Kevin Buckley
On 2019/10/07 05:24, Eliot Moss wrote: On 10/6/2019 9:23 AM, George Wm Turner wrote: I stumbled across CRIU (Checkpoint/Restore In Userspace) https://criu.org/Main_Page a couple of weeks ago.  I have not utilized it yet it; it's on my ToDo list. They claim that it’s packaged with most distress;

Re: [slurm-users] Does Slurm store "time in current state" values anywhere ?

2019-10-03 Thread Kevin Buckley
On 2019/10/04 03:26, David Rhey wrote: Whilst we're not looking to provide succour to meta-scheduler writers, we can see a need for some way to present and/or make use of, a "job has been in state S for time T" or "job entered current state at time T" info.

[slurm-users] Does Slurm store "time in current state" values anywhere ?

2019-10-03 Thread Kevin Buckley
Hi there, we're hoping to overcome an issue where some of our users are keen on writing their own meta-schedulers, so as to try and beat the actual scheduler, but can't seemingly do as good a job as a scheduler that's been developed by people who understand scheduling (no real surprises there!),

Re: [slurm-users] Dependencies with singleton and after

2019-08-22 Thread Kevin Buckley
On 2019/08/22 04:51, Jarno van der Kolk wrote: Hi, I am helping a researcher who encountered an unexpected behaviour with dependencies. He uses both "singleton" and "after". The minimal working example is as follows: $ sbatch --hold fakejob.sh Submitted batch job 25909273 $ sbatch --hold fakej

Re: [slurm-users] Slurm version 19.05.2 is now available

2019-08-13 Thread Kevin Buckley
On 2019/08/14 03:11, Tim Wickberg wrote: Slurm version 19.05.2 is now available, and includes a series of minor bug fixes since 19.05.1 was released over a month ago. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. Looking at src/common/read_c

[slurm-users] Slurm 19's "Changed the default fair share algorithm to "fair tree".": implications for slurm.conf PriorityFlags setting

2019-07-15 Thread Kevin Buckley
etting in slurm.conf PriorityFlags=FAIR_TREE is now redundant, because it's the default ? Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Kevin Buckley
On 2019/05/09 23:37, Christopher Benjamin Coffey wrote: Feel free to try it out and let us know how it works for you! https://github.com/nauhpc/job_archive So Chris, testing it out quickly, and dirtily, using an sbatch with a here document, vis: $ sbatch -p testq <

[slurm-users] Slurm tarball numbering vs RPM numbering for first release tarballs.

2019-06-09 Thread Kevin Buckley
This actually just tripped me up on a Cray, but I belive the observation is still worthy of discussion. If I take the slurm-19.05.0.tar.bz2 tarball from the SchedMD download site, and then do a direct RPM build on it, so rpmbuild -ta slurm-19.05.0.tar.bz2 what I end up generating are the f

[slurm-users] SLURM User Group Meetings: "Back Issues"

2019-03-27 Thread Kevin Buckley
I happened to be reading the NERSC website's news article https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2017/nersc-co-hosts-2017-slurm-user-group-meeting/ which searching for a particular talk. The NERSC news article contains a link to the SchedMD website behind the "xl

[slurm-users] What is the 2^32-1 values in "stepd_connect to .4294967295 failed" telling you

2019-03-08 Thread Kevin Buckley
kind; does SLURM use, internally, negative step IDs that don't usally enter the public consciousness via its logging, or is this telling us something else ? Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre Eml: kevin.buck...@pawsey.org.au

[slurm-users] An observation on SLURM's logging

2018-11-27 Thread Kevin Buckley
at, you can get ALL of the messages on SOME of the channels, SOME of the time sadly it would appear that, you can't get ALL of the messages on ALL of the channels, ALL of the time Am I a fool for thinking that you should be able to ? Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre

[slurm-users] Default value inconsistency between CompleteWait and KillWait in slurm.conf docs

2018-11-04 Thread Kevin Buckley
It has been pointed by someone here, who clearly reads the documentation a lot closer than most, that there appears to be an inconsistency in the way that the SLURM documentation, explains the link between CompleteWait and KillWait. The default values for CompleteWait and KillWait are (cut and p

[slurm-users] Renaming a Reservation

2018-09-24 Thread Kevin Buckley
Is there a way to rename a Reservation ? Looking at the scontrol docs, it would seem that the UPDATE sub-command keys off of the reservation name, but what if one wanted to rename the reservation itself ? -- Supercomputing Systems Administrator Pawsey Supercomputing Centre