Dear Slurm-user list,
Sadly, my question got no answers. If the question is unclear and you
have ideas how I can improve it, please let me know. We will soon try to
update Slurm to see if the unwanted behavior disappears with that.
Best regards,
Xaver Stiensmeier
On 11/18/24 12:03, Xaver
Dear Slurm-user list,
when a job fails because the node startup fails (cloud scheduling), the
job should be re-queued:
Resume Timeout
Maximum time permitted (in seconds) between when a node resume
request is issued and when the node is actually available for use.
Nodes which fail to
.de>| www.igd.fraunhofer.de
<http://www.igd.fraunhofer.de/>
*From:*Xaver Stiensmeier via slurm-users
*Sent:* Freitag, 15. November 2024 14:03
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: How to power up all ~idle nodes and
verify that they have started up without issue programmatical
er, that feels a bit clunky and the output is definitely
not perfect as it needs parsing.
Best regards,
Xaver
On 11/14/24 14:36, Xaver Stiensmeier wrote:
Hi Ole,
thank you for your answer!
I apologize for the unclear wording. We have already implemented the
on demand scheduling.
However, we
program. Of course this includes checking whether the nodes power up.
Best regards,
Xaver
Am 14/11/2024 um 14:57 schrieb Ole Holm Nielsen via slurm-users:
Hi Xaver,
On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:
I would like to startup all ~idle (idle and powered down) nodes and
Dear Slurm User list,
I would like to startup all ~idle (idle and powered down) nodes and
check programmatically if all came up as expected. For context: this is
for a program that sets up slurm clusters with on demand cloud scheduling.
In the most easiest fashion this could be executing a comma
Hey Nate,
we actually fixed our underlying issue that caused the NOT_RESPONDING
flag - on fails we automatically terminated the node manually instead of
letting Slurm call the terminate script. That lead to Slurm believing
the node should still be there when it was terminated already.
Therefore,
Thanks Steffen,
that makes a lot of sense. I will just not start slurmd in the master
ansible role when the master is not to be used for computing.
Best regards,
Xaver
On 24.06.24 14:23, Steffen Grunewald via slurm-users wrote:
On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote:
Dear Slu
Dear Slurm users,
in our project we exclude the master from computing before starting
Slurmctld. We used to exclude the master from computing by simply not
mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, thi
Dear slurm-user list,
as far as I understood it, the slurm.conf needs to be present on the
master and on the workers at slurm.conf (if no other path is set via
SLURM_CONF). However, I noticed that when adding a partition only in the
master's slurm.conf, all workers were able to "correctly" show t
any hot-fix/updates
from the base image or changes. By running it from the node, it would
alleviate any cpu spikes on the slurm head node.
Just a possible path to look at.
Brian Andrus
On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote:
Dear slurm user list,
we make use of elast
Dear slurm user list,
we make use of elastic cloud computing i.e. node instances are created
on demand and are destroyed when they are not used for a certain amount
of time. Created instances are set up via Ansible. If more than one
instance is requested at the exact same time, Slurm will pass th
ify how to handle "NOT_RESPONDING".
I would really like to improve my question if necessary.
Best regards,
Xaver
On 23.02.24 18:55, Xaver Stiensmeier wrote:
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it
can happen that slurm's resumeT
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it can
happen that slurm's resumeTimeout is reached and the node is therefore
powered down. We have set ReturnToService=2 in order to avoid the node
being marked down, because the instance behind that node is
Dear slurm-user list,
I had cases where our resumeProgram failed due to temporary cloud
timeouts. In that case the resumeProgram returns a value =/= 0. Why does
Slurm still wait until resumeTimeout instead of just accepting the
startup as failed which then should lead to a rescheduling of the job
Thank you for your response.
I have found found out why there was no error in the log: I've been
looking at the wrong log. The error didn't occur on the master, but on
our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as
just another worker in the same network. The error I get
Dear slurm-user list,
I got this error:
Unable to start service slurmctld: Job for slurmctld.service failed
because the control process exited with error code.\nSee \"systemctl
status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for
details.
but in slurmctld.service I see nothi
are getting filled on the node. You can run 'df
-h' and see some info that would get you started.
Brian Andrus
On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote:
Dear slurm-user list,
during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpool
Slurmd is placing in this dir that fills
up the space. Do you have any ideas? Due to the workflow used, we have a
hard time reconstructing the exact scenario that caused this error. I
guess, the "fix" is to just pick a bit larger disk, but I am unsure
whether Slurm behaves normal here.
or not, but it's worth a try.
Best regards
Xaver
On 06.12.23 12:03, Ole Holm Nielsen wrote:
On 12/6/23 11:51, Xaver Stiensmeier wrote:
Good idea. Here's our current version:
```
sinfo -V
slurm 22.05.7
```
Quick googling told me that the latest version is 23.11. Does the
upgrade chang
m may matter for your power saving experience. Do
you run an updated version?
/Ole
On 12/6/23 10:54, Xaver Stiensmeier wrote:
Hi Ole,
I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It mig
0, Ole Holm Nielsen wrote:
Hi Xavier,
On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in
|slurm_update error: Invalid node state specified|
when we called:
|scontrol update NodeN
rs.
Maybe someone has a great idea how to tackle this problem.
Best regards
Xaver Stiensmeier
emctl restart slurmd*
# master
run without any issues afterwards.
Thank you for all your help!
Best regards,
Xaver
On 19.07.23 17:05, Xaver Stiensmeier wrote:
Hi Hermann,
count doesn't make a difference, but I noticed that when I reconfigure
slurm and do reloads afterwards, the er
I think you are missing the "Count=..." part in gres.conf
It should read
NodeName=NName Name=gpu File=/dev/tty0 Count=1
in your case.
Regards,
Hermann
On 7/19/23 14:19, Xaver Stiensmeier wrote:
Okay,
thanks to S. Zhang I was able to figure out why nothing changed.
While I did resta
further. I am thankful for any ideas in that regard.
Best regards,
Xaver
On 19.07.23 10:23, Xaver Stiensmeier wrote:
Alright,
I tried a few more things, but I still wasn't able to get past: srun:
error: Unable to allocate resources: Invalid generic resource (gres)
specification.
I should
-----------
*From:* slurm-users on behalf
of Xaver Stiensmeier
*Sent:* Monday, July 17, 2023 9:43 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] GRES and GPUs
Hi Hermann,
Good idea, but we are already using `SelectType=select/cons_tr
just for testing purposes. Could this be the issue?
Best regards,
Xaver Stiensmeier
On 17.07.23 14:11, Hermann Schwärzler wrote:
Hi Xaver,
what kind of SelectType are you using in your slurm.conf?
Per https://slurm.schedmd.com/gres.html you have to consider:
"As for the --gpu* option, the
(GPU, MPS, MIG) and using one of
those didn't work in my case.
Obviously, I am misunderstanding something, but I am unsure where to look.
Best regards,
Xaver Stiensmeier
:
Allowing all nodes to be powered up, but without automatic suspending
for some nodes except when triggering power down manually.
---
I tried using negative times for SuspendTime, but that didn't seem to
work as no nodes are powered up then.
Best regards,
Xaver Stiensmeier
both partitions and allocates
all 8 nodes.
Best regards,
Xaver Stiensmeier
question as my question asks
how to have multiple default partitions which could include having
others that are not default.
Best regards,
Xaver Stiensmeier
On 17.04.23 11:12, Xaver Stiensmeier wrote:
Dear slurm-users list,
is it possible to somehow have two default partitions? In the best cas
Dear slurm-users list,
is it possible to somehow have two default partitions? In the best case
in a way that slurm schedules to partition1 on default and only to
partition2 when partition1 can't handle the job right now.
Best regards,
Xaver Stiensmeier
or
were larger instances started than needed? ...
I know that this question is currently very open, but I am still trying
to narrow down where I have to look. The final goal is of course to use
this evaluation to pick better timeout values and improve cloud scheduling.
Best regards,
Xaver Stiensmeier
nodes.
So I am basically looking for custom requirements.
Best regards,
Xaver Stiensmeier
n" in `JobSubmitPlugins`
and this might be the solution. However, I think this is something so
basic that it probably shouldn't need a plugin so I am unsure.
Can anyone point me towards how setting the default partition is done?
Best regards,
Xaver Stiensmeier
am just stating this to be
maximum explicit.
Best regards,
Xaver Stiensmeier
PS: This is the first time I use the slurm-user list and I hope I am not
violating any rules with this question. Please let me know, if I do.
37 matches
Mail list logo