[slurm-users] Re: Why is my job killed when ResumeTimeout is reached instead of it being requeued?

2024-12-09 Thread Xaver Stiensmeier via slurm-users
Dear Slurm-user list, Sadly, my question got no answers. If the question is unclear and you have ideas how I can improve it, please let me know. We will soon try to update Slurm to see if the unwanted behavior disappears with that. Best regards, Xaver Stiensmeier On 11/18/24 12:03, Xaver

[slurm-users] Why is my job killed when ResumeTimeout is reached instead of it being requeued?

2024-11-18 Thread Xaver Stiensmeier via slurm-users
Dear Slurm-user list, when a job fails because the node startup fails (cloud scheduling), the job should be re-queued: Resume Timeout Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to

[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-15 Thread Xaver Stiensmeier via slurm-users
.de>| www.igd.fraunhofer.de <http://www.igd.fraunhofer.de/> *From:*Xaver Stiensmeier via slurm-users *Sent:* Freitag, 15. November 2024 14:03 *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatical

[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-15 Thread Xaver Stiensmeier via slurm-users
er, that feels a bit clunky and the output is definitely not perfect as it needs parsing. Best regards, Xaver On 11/14/24 14:36, Xaver Stiensmeier wrote: Hi Ole, thank you for your answer! I apologize for the unclear wording. We have already implemented the on demand scheduling. However, we

[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-14 Thread Xaver Stiensmeier via slurm-users
program. Of course this includes checking whether the nodes power up. Best regards, Xaver Am 14/11/2024 um 14:57 schrieb Ole Holm Nielsen via slurm-users: Hi Xaver, On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote: I would like to startup all ~idle (idle and powered down) nodes and

[slurm-users] How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-14 Thread Xaver Stiensmeier via slurm-users
Dear Slurm User list, I would like to startup all ~idle (idle and powered down) nodes and check programmatically if all came up as expected. For context: this is for a program that sets up slurm clusters with on demand cloud scheduling. In the most easiest fashion this could be executing a comma

[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-09-20 Thread Xaver Stiensmeier via slurm-users
Hey Nate, we actually fixed our underlying issue that caused the NOT_RESPONDING flag - on fails we automatically terminated the node manually instead of letting Slurm call the terminate script. That lead to Slurm believing the node should still be there when it was terminated already. Therefore,

[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Xaver Stiensmeier via slurm-users
Thanks Steffen, that makes a lot of sense. I will just not start slurmd in the master ansible role when the master is not to be used for computing. Best regards, Xaver On 24.06.24 14:23, Steffen Grunewald via slurm-users wrote: On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote: Dear Slu

[slurm-users] How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Xaver Stiensmeier via slurm-users
Dear Slurm users, in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:     PartitionName=SomePartition Nodes=master or something similar. Apparently, thi

[slurm-users] Slurm.conf and workers

2024-04-15 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, as far as I understood it, the slurm.conf needs to be present on the master and on the workers at slurm.conf (if no other path is set via SLURM_CONF). However, I noticed that when adding a partition only in the master's slurm.conf, all workers were able to "correctly" show t

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-09 Thread Xaver Stiensmeier via slurm-users
any hot-fix/updates from the base image or changes. By running it from the node, it would alleviate any cpu spikes on the slurm head node. Just a possible path to look at. Brian Andrus On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm user list, we make use of elast

[slurm-users] Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Xaver Stiensmeier via slurm-users
Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up via Ansible. If more than one instance is requested at the exact same time, Slurm will pass th

[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-29 Thread Xaver Stiensmeier via slurm-users
ify how to handle "NOT_RESPONDING". I would really like to improve my question if necessary. Best regards, Xaver On 23.02.24 18:55, Xaver Stiensmeier wrote: Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeT

[slurm-users] Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-23 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is

[slurm-users] Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0

2024-02-19 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the job

[slurm-users] Re: Errors upgrading to 23.11.0 -- jwt-secret.key

2024-02-08 Thread Xaver Stiensmeier via slurm-users
Thank you for your response. I have found found out why there was no error in the log: I've been looking at the wrong log. The error didn't occur on the master, but on our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as just another worker in the same network. The error I get

[slurm-users] Errors upgrading to 23.11.0

2024-02-07 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see nothi

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Xaver Stiensmeier
are getting filled on the node. You can run 'df -h' and see some info that would get you started. Brian Andrus On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpool

[slurm-users] SlurmdSpoolDir full

2023-12-08 Thread Xaver Stiensmeier
Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here.

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
or not, but it's worth a try. Best regards Xaver On 06.12.23 12:03, Ole Holm Nielsen wrote: On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade chang

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
m may matter for your power saving experience.  Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It mig

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
0, Ole Holm Nielsen wrote: Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeN

[slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
rs. Maybe someone has a great idea how to tackle this problem. Best regards Xaver Stiensmeier

Re: [slurm-users] GRES and GPUs

2023-07-20 Thread Xaver Stiensmeier
emctl restart slurmd* # master run without any issues afterwards. Thank you for all your help! Best regards, Xaver On 19.07.23 17:05, Xaver Stiensmeier wrote: Hi Hermann, count doesn't make a difference, but I noticed that when I reconfigure slurm and do reloads afterwards, the er

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
I think you are missing the "Count=..." part in gres.conf It should read NodeName=NName Name=gpu File=/dev/tty0 Count=1 in your case. Regards, Hermann On 7/19/23 14:19, Xaver Stiensmeier wrote: Okay, thanks to S. Zhang I was able to figure out why nothing changed. While I did resta

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
further. I am thankful for any ideas in that regard. Best regards, Xaver On 19.07.23 10:23, Xaver Stiensmeier wrote: Alright, I tried a few more things, but I still wasn't able to get past: srun: error: Unable to allocate resources: Invalid generic resource (gres) specification. I should

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier
----------- *From:* slurm-users on behalf of Xaver Stiensmeier *Sent:* Monday, July 17, 2023 9:43 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] GRES and GPUs Hi Hermann, Good idea, but we are already using `SelectType=select/cons_tr

Re: [slurm-users] GRES and GPUs

2023-07-17 Thread Xaver Stiensmeier
just for testing purposes. Could this be the issue? Best regards, Xaver Stiensmeier On 17.07.23 14:11, Hermann Schwärzler wrote: Hi Xaver, what kind of SelectType are you using in your slurm.conf? Per https://slurm.schedmd.com/gres.html you have to consider: "As for the --gpu* option, the

[slurm-users] GRES and GPUs

2023-07-17 Thread Xaver Stiensmeier
(GPU, MPS, MIG) and using one of those didn't work in my case. Obviously, I am misunderstanding something, but I am unsure where to look. Best regards, Xaver Stiensmeier

[slurm-users] Prevent CLOUD node from being shutdown after startup

2023-05-12 Thread Xaver Stiensmeier
: Allowing all nodes to be powered up, but without automatic suspending for some nodes except when triggering power down manually. --- I tried using negative times for SuspendTime, but that didn't seem to work as no nodes are powered up then. Best regards, Xaver Stiensmeier

[slurm-users] Submit sbatch to multiple partitions

2023-04-17 Thread Xaver Stiensmeier
both partitions and allocates all 8 nodes. Best regards, Xaver Stiensmeier

Re: [slurm-users] Multiple default partitions

2023-04-17 Thread Xaver Stiensmeier
question as my question asks how to have multiple default partitions which could include having others that are not default. Best regards, Xaver Stiensmeier On 17.04.23 11:12, Xaver Stiensmeier wrote: Dear slurm-users list, is it possible to somehow have two default partitions? In the best cas

[slurm-users] Multiple default partitions

2023-04-17 Thread Xaver Stiensmeier
Dear slurm-users list, is it possible to somehow have two default partitions? In the best case in a way that slurm schedules to partition1 on default and only to partition2 when partition1 can't handle the job right now. Best regards, Xaver Stiensmeier

[slurm-users] Evaluation: How collect data regarding slurms cloud scheduling performance?

2023-02-28 Thread Xaver Stiensmeier
or were larger instances started than needed? ... I know that this question is currently very open, but I am still trying to narrow down where I have to look. The final goal is of course to use this evaluation to pick better timeout values and improve cloud scheduling. Best regards, Xaver Stiensmeier

[slurm-users] Request nodes with a custom resource?

2023-02-05 Thread Xaver Stiensmeier
nodes. So I am basically looking for custom requirements. Best regards, Xaver Stiensmeier

[slurm-users] How to set default partition in slurm configuration

2023-01-25 Thread Xaver Stiensmeier
n" in `JobSubmitPlugins` and this might be the solution. However, I think this is something so basic that it probably shouldn't need a plugin so I am unsure. Can anyone point me towards how setting the default partition is done? Best regards, Xaver Stiensmeier

[slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

2022-11-23 Thread Xaver Stiensmeier
am just stating this to be maximum explicit. Best regards, Xaver Stiensmeier PS: This is the first time I use the slurm-user list and I hope I am not violating any rules with this question. Please let me know, if I do.