ect bandwidth for multi-node jobs. Your node41 might be
one - or the first one of a series - that would leave bigger chunks unused
for bigger tasks.)
HTH,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1
/usr/local/slurm/etc
Best,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
--
slurm-users mailing
bs in higher
> priority partitions)
> Anyone can help to fix this?
Not without a little bit of extra information,
e.g. "sinfo -p cpu" and maybe "scontrol show job=26"
Best,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational
ut.
(I use to pipe the output through "xargs" most of the time, too.)
Best,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: st
g mode.
Stop their services, start them manually one by one (ctld first), then
watch whether they talk to each other, and if they don't, learn what stops
them from doing so - then iterate editing the config, "scontrol reconfig",
lather, rinse, repeat.
You're the only one knowing y
adjusted after an upgrade, maybe you're hitting the same?)
Best,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald
On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote:
> Warning on that one, it can eat up a ton of database space (depending on
> size of environment, uniqueness of environment between jobs, and number of
> jobs). We had it on and it nearly ran us out of space on our database host.
> That said
uot;sacct" just lacking a job field, or is this info indeed dropped and
not stored in the DB?
Thanks,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49
entry (or even run nvidia-smi in this way)...
Best,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
-
On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote:
> Good Morning;
>
> This is not a slurm issue. This is a default shell script feature. If you
> want to wait to finish until all background processes, you should use wait
> command after all.
Thank you - I already knew this in principle, an
to find that place...
Thanks,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
--
slurm-users mailing
in such a case?
Since this only applies to a node that has been defined in the config,
and you (correctly) didn't do so, there's no need (and no means) to
"drain" it.
Best
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert
ach that
of possible exceptions to the implemented rule (there may be more than just
ARM - what about RISC-V, PPC64*, ...?)
(b) interrupt the build in a reasonable place, find all occurreences of the
wrong emulation string, and replace it with its existing counterpart
There should be no doubt which
0:00 1 (Nodes
> required for job are DOWN, DRAINED or reserved for jobs in higher priority
> partitions)
What does "sinfo" tell you? Is there a running slurmd?
- S
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Ein
On Wed, 2024-01-24 at 14:34:02 +0100, Steffen Grunewald wrote:
>
> After=network.target munge.service autofs.service
Also, probably the more important change,
RequiresMountsFor=/home/slurm
> because my /home directories are automounted and /etc/slurm is pointing to
> /ho
is missing here ?
Quite probably a dependency between services.
Good luck,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
Hi all,
after several delays, we're done with the move now.
On Tue, 2023-02-07 at 14:54:55 +0100, Steffen Grunewald wrote:
> Hi Loris,
>
> On Tue, 2023-01-24 at 16:48:26 +0100, Loris Bennett wrote:
> > Hi Steffen,
> >
> > Could you create/find a deb-package fo
hich could easily be scripted for most of the gory details).
Thanks,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
Salut Stephane,
On Tue, 2023-01-24 at 15:53:03 +0100, Stephane Vaillant wrote:
> Le 24/01/2023 à 10:09, Steffen Grunewald a écrit :
> > Hello,
> >
> > is there anyone using plain Debian with Slurm (as provided by the OS
> > repository),
> > who might have sugges
's version) while Bullseye comes with
20.11.x, which is beyond the 2-version upgrade range. Is there hope (and how to
verify that) that a dist-upgrade would do the right thing?
Thanks,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Ein
ant to know that - given that there are
approaches to multi-node operation beyond MPI (Charm++ comes to mind)?
Best,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Hi Luis,
On Wed, 2022-07-20 at 16:03:00 -0700, Luis Huang wrote:
> In the past we've been using Centos 7 with slurm.spec file provided by
> schedmd to build the rpms. This is working great where we can deploy the
> rpms and perform upgrades via puppet.
>
> As we are moving to Ubuntu. I noticed th
es, of course) -
I never saw GIDs mentioned though.
Does this help?
- Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
Hi Brice,
old dog still learning new tricks...
On Wed, 2021-12-22 at 17:40:17 +0100, Brice Goglin wrote:
>
> Le 22/12/2021 à 17:27, Steffen Grunewald a écrit :
> > On Wed, 2021-12-22 at 16:02:00 +, Stuart MacLachlan wrote:
> > > Hi Steffan,
> > >
> > &g
t;
> Will I have to wait until the machines have arrived, and do some experiments,
> or did someone already retrieve the right numbers, and is willing to share?
>
> Thanks,
> Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
13 machines, and perhaps add a single 7713
one - "lscpu" outputs are wildly different, while the total number of cores
is correct.
Will I have to wait until the machines have arrived, and do some experiments,
or did someone already retrieve the right numbers, and is willing to share?
On Fri, 2021-12-17 at 13:03:32 +0530, Sudeep Narayan Banerjee wrote:
> Hello All: Can we please restrict one GPU job on one GPU node?
>
> That is,
> a) when we submit a GPU job on an empty node (say gpu2) requesting 16 cores
> as that gives the best performance in the GPU and it gives best perform
ion if the user
> doesn't specify any?
A bit ugly, but what may also work is to (ab)use the topology feature to
group hosts together - topological subdomains get assigned first.
- Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert
daemon is "slurmctld" not "slurmdctl", so you have
rotated (and created) the wrong file, which will not be written to...
Cheers,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-144
On Thu, 2020-05-14 at 13:10:04 +, Florian Zillner wrote:
> Hi,
>
> I'm experimenting with slurm's power saving feature and shutdown of "idle"
> nodes works in general, also the power up works when "idle~" nodes are
> requested.
> So far so good, but slurm is also shutting down nodes that are
ld need a restart of the slurmctld to
accept new users. I suspect that this has to do with SlurmUser,
StorageUser and AccountingStorageUser all set to "slurm" (this works
on the CentOS-7 HPC cluster next door) - do you have any advice?
Thanks,
Steffen
--
Steffen Grunewald, Cluster
weird
- "scontrol reconfig" tends to kill the slurmctld
Upgrading to Buster isn't an option yet, and I doubt the issues would
vaporize by upgrading.
Any suggestions?
Thanks,
- S
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Alber
ndication of
a slurmctld crash corresponding to that day.
In any case, the situation apparently has been resolved - I've got to wait
for the daily rollup to fix the old accounting data though.
Thanks a lot!
- Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitat
27;s no end date)?
Thanks for any suggestion.
- Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
On Fri, 2019-09-27 at 14:58:40 +0200, Rafał Kędziorski wrote:
> Am Fr., 27. Sept. 2019 um 13:50 Uhr schrieb Steffen Grunewald <
> steffen.grunew...@aei.mpg.de>:
> > On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote:
> > >
> > > you may try setti
On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote:
> Hi Rafał,
>
> you may try setting `ReturnToService=2´ in slurm.conf.
>
> Best regards
> Jürgen
Caveat: A spontaneously rebooting machine may create a "black hole" this way.
- Steffen
--
Steffen Grunewal
ot glue about what is happening. Maybe the next thing to try is to
> use "sdiag" to inspect the server. Another complication is that the problem
> is random, so we put "sdiag" in a cronjob? Is there a better way to run
> "sdiag" periodically?
>
>
; node?
This may require changing some more environment variables, and may harm
signalling.
Okay, my suggestion reads like a terrible kludge (which it certainly is), but
AFAIK there's no way to tell Slurm about "preferred first nodes".
- S
--
Steffen Grunewald, Cluster Administrat
On Tue, 2018-01-09 at 13:16:12 +0100, Elisabetta Falivene wrote:
> Root file system is on the master. I'm being able to boot the machine
> changing kernel. Grub allow to boot from two kernel:
>
>
> kernel 3.2.0-4-amd64
>
> kernel 3.16.0-4-amd64
>
>
> The problem is with kernel 3.16, but boots
On Tue, 2018-01-09 at 18:25:27 +0800, 刘科 wrote:
> Dear slrum developers:
>
> I alias squeue to squeue -u $(whoami) -o "%.10i %.10P %.20j %.2t %.10M %.9l
> %.6D %R %Z". Now it shows the full path of working dir, but it seems too
> long to show all the useful messages in online.
>
> Is there any wa
Hi David,
On Wed, 2017-11-29 at 14:45:06 +, david vilanova wrote:
> Hello,
> I have installed latest 7.11 release and my node is shown as down.
> I hava a single physical server with 12 cores so not sure the conf below is
> correct ?? can you help ??
>
> In slurm.conf the node is configure as
41 matches
Mail list logo