[slurm-users] StateSaveLocation and Slurm HA

2024-05-07 Thread Pierre Abele via slurm-users
Hi all, I am looking for a clean way to set up Slurms native high availability feature. I am managing a Slurm cluster with one control node (hosting both slurmctld and slurmdbd), one login node and a few dozen compute nodes. I have a virtual machine that I want to set up as a backup control n

[slurm-users] "token expired" errors with auth/slurm

2024-05-07 Thread Fabio Ranalli via slurm-users
Hi there, We've updated to 23.11.6 and replaced MUNGE with SACK. Performance and stability have both been pretty good, but we're occasionally seeing this in the slurmctld.log /[2024-05-07T03:50:16.638] error: decode_jwt: token expired at 1715053769 [2024-05-07T03:50:16.638] error: cred_p_unpa

[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Fabio Ranalli via slurm-users
You can try DRBD https://linbit.com/drbd/ or a shared-disk (clustered) FS like GFS2, OCFS2, etc https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/configuring_gfs2_file_systems/index https://docs.oracle.com/en/operating-systems/oracle-linux/9/shareadmin/shareadm

[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Davide DelVento via slurm-users
Are you seeking something simple rather than sophisticated? If so, you can use the controller local disk for StateSaveLocation and place a cron job (on the same node or somewhere else) to take that data out via e.g. rsync and put it where you need it (NFS?) for the backup control node to use if/whe

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Henderson, Brent via slurm-users
Over the past few days I grabbed some time on the nodes and ran for a few hours. Looks like I *can* still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so. Guessing that exonerates cgroups as the cause, but possibly just a good w

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Ole Holm Nielsen via slurm-users
On 5/7/24 15:32, Henderson, Brent via slurm-users wrote: Over the past few days I grabbed some time on the nodes and ran for a few hours.  Looks like I **can** still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so.  Guessing that

[slurm-users] scrontab question

2024-05-07 Thread Sandor via slurm-users
I am working out the details of scrontab. My initial testing is giving me an unsolvable question Within scrontab editor I have the following example from the slurm documentation: 0,5,10,15,20,25,30,35,40,45,50,55 * * * * /directory/subdirectory/crontest.sh When I save it, scrontab marks the line

[slurm-users] Re: [ext] scrontab question

2024-05-07 Thread Hagdorn, Magnus Karl Moritz via slurm-users
Hm, strange. I don't see a problem with the time specs, although I would use */5 * * * * to run something every 5 minutes. In my scrontab I also specify a partition, etc. But I don't think that is necessary. regards magnus On Di, 2024-05-07 at 12:06 -0500, Sandor via slurm-users wrote: > I am work

[slurm-users] Re: scrontab question

2024-05-07 Thread Bjørn-Helge Mevik via slurm-users
Sandor via slurm-users writes: > I am working out the details of scrontab. My initial testing is giving me > an unsolvable question If you have an unsolvable problem, you don't have a problem, you have a fact of life. :) > Within scrontab editor I have the following example from the slurm > d