[slurm-users] Re: Unexpected node got allocation

2025-01-09 Thread Steffen Grunewald via slurm-users
ect bandwidth for multi-node jobs. Your node41 might be one - or the first one of a series - that would leave bigger chunks unused for bigger tasks.) HTH, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1

[slurm-users] Re: Permission denied for slurmdbd.conf

2025-01-07 Thread Steffen Grunewald via slurm-users
/usr/local/slurm/etc Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing

[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

2025-01-07 Thread Steffen Grunewald via slurm-users
bs in higher > priority partitions) > Anyone can help to fix this? Not without a little bit of extra information, e.g. "sinfo -p cpu" and maybe "scontrol show job=26" Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational

[slurm-users] Re: formatting node names

2025-01-07 Thread Steffen Grunewald via slurm-users
ut. (I use to pipe the output through "xargs" most of the time, too.) Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: st

[slurm-users] Re: slurm nodes showing down*

2024-12-09 Thread Steffen Grunewald via slurm-users
g mode. Stop their services, start them manually one by one (ctld first), then watch whether they talk to each other, and if they don't, learn what stops them from doing so - then iterate editing the config, "scontrol reconfig", lather, rinse, repeat. You're the only one knowing y

[slurm-users] Re: error: Unable to contact slurm controller (connect failure)

2024-11-18 Thread Steffen Grunewald via slurm-users
adjusted after an upgrade, maybe you're hitting the same?) Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Steffen Grunewald via slurm-users
On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote: > Warning on that one, it can eat up a ton of database space (depending on > size of environment, uniqueness of environment between jobs, and number of > jobs). We had it on and it nearly ran us out of space on our database host. > That said

[slurm-users] Find out submit host of past job?

2024-08-07 Thread Steffen Grunewald via slurm-users
uot;sacct" just lacking a job field, or is this info indeed dropped and not stored in the DB? Thanks, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49

[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Steffen Grunewald via slurm-users
entry (or even run nvidia-smi in this way)... Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -

[slurm-users] Re: Background tasks in Slurm scripts?

2024-07-26 Thread Steffen Grunewald via slurm-users
On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote: > Good Morning; > > This is not a slurm issue. This is a default shell script feature. If you > want to wait to finish until all background processes, you should use wait > command after all. Thank you - I already knew this in principle, an

[slurm-users] Background tasks in Slurm scripts?

2024-07-26 Thread Steffen Grunewald via slurm-users
to find that place... Thanks, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing

[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Steffen Grunewald via slurm-users
in such a case? Since this only applies to a node that has been defined in the config, and you (correctly) didn't do so, there's no need (and no means) to "drain" it. Best Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Steffen Grunewald via slurm-users
ach that of possible exceptions to the implemented rule (there may be more than just ARM - what about RISC-V, PPC64*, ...?) (b) interrupt the build in a reasonable place, find all occurreences of the wrong emulation string, and replace it with its existing counterpart There should be no doubt which

[slurm-users] Re: single node configuration

2024-04-10 Thread Steffen Grunewald via slurm-users
0:00 1 (Nodes > required for job are DOWN, DRAINED or reserved for jobs in higher priority > partitions) What does "sinfo" tell you? Is there a running slurmd? - S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Ein

Re: [slurm-users] slurm-config on NFS-volume

2024-01-24 Thread Steffen Grunewald
On Wed, 2024-01-24 at 14:34:02 +0100, Steffen Grunewald wrote: > > After=network.target munge.service autofs.service Also, probably the more important change, RequiresMountsFor=/home/slurm > because my /home directories are automounted and /etc/slurm is pointing to > /ho

Re: [slurm-users] slurm-config on NFS-volume

2024-01-24 Thread Steffen Grunewald
is missing here ? Quite probably a dependency between services. Good luck, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~

Re: [slurm-users] Debian dist-upgrade?

2023-04-26 Thread Steffen Grunewald
Hi all, after several delays, we're done with the move now. On Tue, 2023-02-07 at 14:54:55 +0100, Steffen Grunewald wrote: > Hi Loris, > > On Tue, 2023-01-24 at 16:48:26 +0100, Loris Bennett wrote: > > Hi Steffen, > > > > Could you create/find a deb-package fo

Re: [slurm-users] Debian dist-upgrade?

2023-02-07 Thread Steffen Grunewald
hich could easily be scripted for most of the gory details). Thanks, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~

Re: [slurm-users] Debian dist-upgrade?

2023-01-24 Thread Steffen Grunewald
Salut Stephane, On Tue, 2023-01-24 at 15:53:03 +0100, Stephane Vaillant wrote: > Le 24/01/2023 à 10:09, Steffen Grunewald a écrit : > > Hello, > > > > is there anyone using plain Debian with Slurm (as provided by the OS > > repository), > > who might have sugges

[slurm-users] Debian dist-upgrade?

2023-01-24 Thread Steffen Grunewald
's version) while Bullseye comes with 20.11.x, which is beyond the 2-version upgrade range. Is there hope (and how to verify that) that a dist-upgrade would do the right thing? Thanks, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Ein

Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Steffen Grunewald
ant to know that - given that there are approaches to multi-node operation beyond MPI (Charm++ comes to mind)? Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~

Re: [slurm-users] Deb packages for Ubuntu

2022-07-21 Thread Steffen Grunewald
Hi Luis, On Wed, 2022-07-20 at 16:03:00 -0700, Luis Huang wrote: > In the past we've been using Centos 7 with slurm.spec file provided by > schedmd to build the rpms. This is working great where we can deploy the > rpms and perform upgrades via puppet. > > As we are moving to Ubuntu. I noticed th

Re: [slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Steffen Grunewald
es, of course) - I never saw GIDs mentioned though. Does this help? - Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~

Re: [slurm-users] Node specs for Epyc 7xx3 processors?

2021-12-22 Thread Steffen Grunewald
Hi Brice, old dog still learning new tricks... On Wed, 2021-12-22 at 17:40:17 +0100, Brice Goglin wrote: > > Le 22/12/2021 à 17:27, Steffen Grunewald a écrit : > > On Wed, 2021-12-22 at 16:02:00 +, Stuart MacLachlan wrote: > > > Hi Steffan, > > > > > &g

Re: [slurm-users] Node specs for Epyc 7xx3 processors?

2021-12-22 Thread Steffen Grunewald
t; > Will I have to wait until the machines have arrived, and do some experiments, > or did someone already retrieve the right numbers, and is willing to share? > > Thanks, > Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~

[slurm-users] Node specs for Epyc 7xx3 processors?

2021-12-22 Thread Steffen Grunewald
13 machines, and perhaps add a single 7713 one - "lscpu" outputs are wildly different, while the total number of cores is correct. Will I have to wait until the machines have arrived, and do some experiments, or did someone already retrieve the right numbers, and is willing to share?

Re: [slurm-users] Requirement of one GPU job should run in GPU nodes in a cluster

2021-12-16 Thread Steffen Grunewald
On Fri, 2021-12-17 at 13:03:32 +0530, Sudeep Narayan Banerjee wrote: > Hello All: Can we please restrict one GPU job on one GPU node? > > That is, > a) when we submit a GPU job on an empty node (say gpu2) requesting 16 cores > as that gives the best performance in the GPU and it gives best perform

Re: [slurm-users] Auto-select partition?

2020-10-01 Thread Steffen Grunewald
ion if the user > doesn't specify any? A bit ugly, but what may also work is to (ab)use the topology feature to group hosts together - topological subdomains get assigned first. - Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert

Re: [slurm-users] Slurmctld and log file

2020-09-08 Thread Steffen Grunewald
daemon is "slurmctld" not "slurmdctl", so you have rotated (and created) the wrong file, which will not be written to... Cheers, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-144

Re: [slurm-users] Node suspend / Power saving - for *idle* nodes only?

2020-05-14 Thread Steffen Grunewald
On Thu, 2020-05-14 at 13:10:04 +, Florian Zillner wrote: > Hi, > > I'm experimenting with slurm's power saving feature and shutdown of "idle" > nodes works in general, also the power up works when "idle~" nodes are > requested. > So far so good, but slurm is also shutting down nodes that are

Re: [slurm-users] Slurm on Debian Stretch

2020-05-11 Thread Steffen Grunewald
ld need a restart of the slurmctld to accept new users. I suspect that this has to do with SlurmUser, StorageUser and AccountingStorageUser all set to "slurm" (this works on the CentOS-7 HPC cluster next door) - do you have any advice? Thanks, Steffen -- Steffen Grunewald, Cluster

[slurm-users] Slurm on Debian Stretch

2020-03-03 Thread Steffen Grunewald
weird - "scontrol reconfig" tends to kill the slurmctld Upgrading to Buster isn't an option yet, and I doubt the issues would vaporize by upgrading. Any suggestions? Thanks, - S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Alber

[slurm-users] Solved, Re: Forcibly end "zombie" jobs?

2020-01-10 Thread Steffen Grunewald
ndication of a slurmctld crash corresponding to that day. In any case, the situation apparently has been resolved - I've got to wait for the daily rollup to fix the old accounting data though. Thanks a lot! - Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitat

[slurm-users] Forcibly end "zombie" jobs?

2020-01-08 Thread Steffen Grunewald
27;s no end date)? Thanks for any suggestion. - Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~

Re: [slurm-users] After reboot nodes are in state = down

2019-09-27 Thread Steffen Grunewald
On Fri, 2019-09-27 at 14:58:40 +0200, Rafał Kędziorski wrote: > Am Fr., 27. Sept. 2019 um 13:50 Uhr schrieb Steffen Grunewald < > steffen.grunew...@aei.mpg.de>: > > On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote: > > > > > > you may try setti

Re: [slurm-users] After reboot nodes are in state = down

2019-09-27 Thread Steffen Grunewald
On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote: > Hi Rafał, > > you may try setting `ReturnToService=2´ in slurm.conf. > > Best regards > Jürgen Caveat: A spontaneously rebooting machine may create a "black hole" this way. - Steffen -- Steffen Grunewal

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Steffen Grunewald
ot glue about what is happening. Maybe the next thing to try is to > use "sdiag" to inspect the server. Another complication is that the problem > is random, so we put "sdiag" in a cronjob? Is there a better way to run > "sdiag" periodically? > >

Re: [slurm-users] Is it possible to select the BatchHost for a job through some sort of prolog script?

2018-07-06 Thread Steffen Grunewald
; node? This may require changing some more environment variables, and may harm signalling. Okay, my suggestion reads like a terrible kludge (which it certainly is), but AFAIK there's no way to tell Slurm about "preferred first nodes". - S -- Steffen Grunewald, Cluster Administrat

Re: [slurm-users] Cluster not booting after upgrade to debian jessie

2018-01-09 Thread Steffen Grunewald
On Tue, 2018-01-09 at 13:16:12 +0100, Elisabetta Falivene wrote: > Root file system is on the master. I'm being able to boot the machine > changing kernel. Grub allow to boot from two kernel: > > > kernel 3.2.0-4-amd64 > > kernel 3.16.0-4-amd64 > > > The problem is with kernel 3.16, but boots

Re: [slurm-users] How to short work folder in `squeue`

2018-01-09 Thread Steffen Grunewald
On Tue, 2018-01-09 at 18:25:27 +0800, 刘科 wrote: > Dear slrum developers: > > I alias squeue to squeue -u $(whoami) -o "%.10i %.10P %.20j %.2t %.10M %.9l > %.6D %R %Z". Now it shows the full path of working dir, but it seems too > long to show all the useful messages in online. > > Is there any wa

Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Steffen Grunewald
Hi David, On Wed, 2017-11-29 at 14:45:06 +, david vilanova wrote: > Hello, > I have installed latest 7.11 release and my node is shown as down. > I hava a single physical server with 12 cores so not sure the conf below is > correct ?? can you help ?? > > In slurm.conf the node is configure as