[slurm-users] Re: node3 not working - down

2024-12-09 Thread Chris Samuel via slurm-users
On 9/12/24 5:44 pm, Steven Jones via slurm-users wrote: [2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56

[slurm-users] node3 not working - down

2024-12-09 Thread Steven Jones via slurm-users
Hi, As suggested, 8><--- Stop their services, start them manually one by one (ctld first), then watch whether they talk to each other, and if they don't, learn what stops them from doing so - then iterate editing the config, "scontrol reconfig", lather, rinse, repeat. 8><--- Error logs, On node

[slurm-users] Re: slurm nodes showing down*

2024-12-09 Thread Steven Jones via slurm-users
Is the slurm version critical? [root@node3 /]# sinfo -V slurm 20.11.9 [root@node3 /]# uname -a Linux node3.ods.vuw.ac.nz 4.18.0-553.30.1.el8_10.x86_64 #1 SMP Tue Nov 26 18:56:25 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux [root@node3 /]# root@vuwunicoslurmd1 log]# sinfo -V slurm 22.05.9 [root@vuwun

[slurm-users] Re: slurm nodes showing down*

2024-12-09 Thread Steven Jones via slurm-users
I cannot get node3 to work. After some minutes 4~6 stop but that appears to munge sulking. Node7 never works, seems to be the hwclock is faulty I cant set it so I'll ignore it. My problem is node3, i cant fathom why when 1 & 2 run 3 wont work with slurm, it doesnt appear to be munge. [root@

[slurm-users] AllocNode:Sid in scontrol but not sacct?

2024-12-09 Thread Chris Taylor via slurm-users
Does the accounting database keep this? Maybe I'm missing something but I don't see a way to query for it in sacct. Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: slurm nodes showing down*

2024-12-09 Thread Steven Jones via slurm-users
Hi, I have fixed a time skew. Nodes still down so it wasnt time skew. I have run tests as per munge docs and it all looks OK. [root@node1 ~]# munge -n | unmunge | grep STATUS STATUS: Success (0) [root@node1 ~]# root@node1 ~]# munge -n | unmunge STATUS: Success (0) ENCODE_H

[slurm-users] Re: error and output files

2024-12-09 Thread Davide DelVento via slurm-users
Mmmm, from https://slurm.schedmd.com/sbatch.html > By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. Perhaps at your site there's a configuration which uses separate error files? See the

[slurm-users] Re: Why is my job killed when ResumeTimeout is reached instead of it being requeued?

2024-12-09 Thread Xaver Stiensmeier via slurm-users
Dear Slurm-user list, Sadly, my question got no answers. If the question is unclear and you have ideas how I can improve it, please let me know. We will soon try to update Slurm to see if the unwanted behavior disappears with that. Best regards, Xaver Stiensmeier On 11/18/24 12:03, Xaver Stiens

[slurm-users] Re: slurm nodes showing down*

2024-12-09 Thread Steffen Grunewald via slurm-users
iHi, On Sun, 2024-12-08 at 21:57:11 +, Slurm users wrote: > I have just rebuilt all my nodes and I see Did they ever work before with Slurm? (Which version?) > Only 1 & 2 seem available? > While 3~6 are not Either you didn't wait long enough (5 minutes should be sufficient), or the "down*"