Hi all,
I am looking for a clean way to set up Slurms native high availability
feature. I am managing a Slurm cluster with one control node (hosting
both slurmctld and slurmdbd), one login node and a few dozen compute
nodes. I have a virtual machine that I want to set up as a backup
control n
Hi there,
We've updated to 23.11.6 and replaced MUNGE with SACK.
Performance and stability have both been pretty good, but we're
occasionally seeing this in the slurmctld.log
/[2024-05-07T03:50:16.638] error: decode_jwt: token expired at 1715053769
[2024-05-07T03:50:16.638] error: cred_p_unpa
You can try DRBD
https://linbit.com/drbd/
or a shared-disk (clustered) FS like GFS2, OCFS2, etc
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/configuring_gfs2_file_systems/index
https://docs.oracle.com/en/operating-systems/oracle-linux/9/shareadmin/shareadm
Are you seeking something simple rather than sophisticated? If so, you can
use the controller local disk for StateSaveLocation and place a cron job
(on the same node or somewhere else) to take that data out via e.g. rsync
and put it where you need it (NFS?) for the backup control node to use
if/whe
Over the past few days I grabbed some time on the nodes and ran for a few
hours. Looks like I *can* still hit the issue with cgroups disabled. Incident
rate was 8 out of >11k jobs so dropped an order of magnitude or so. Guessing
that exonerates cgroups as the cause, but possibly just a good w
On 5/7/24 15:32, Henderson, Brent via slurm-users wrote:
Over the past few days I grabbed some time on the nodes and ran for a few
hours. Looks like I **can** still hit the issue with cgroups disabled.
Incident rate was 8 out of >11k jobs so dropped an order of magnitude or
so. Guessing that
I am working out the details of scrontab. My initial testing is giving me
an unsolvable question
Within scrontab editor I have the following example from the slurm
documentation:
0,5,10,15,20,25,30,35,40,45,50,55 * * * *
/directory/subdirectory/crontest.sh
When I save it, scrontab marks the line
Hm, strange. I don't see a problem with the time specs, although I
would use
*/5 * * * *
to run something every 5 minutes. In my scrontab I also specify a
partition, etc. But I don't think that is necessary.
regards
magnus
On Di, 2024-05-07 at 12:06 -0500, Sandor via slurm-users wrote:
> I am work
Sandor via slurm-users writes:
> I am working out the details of scrontab. My initial testing is giving me
> an unsolvable question
If you have an unsolvable problem, you don't have a problem, you have a
fact of life. :)
> Within scrontab editor I have the following example from the slurm
> d