[slurm-users] Slurm management of dual-node server trays?

2024-02-23 Thread Ole Holm Nielsen via slurm-users
We're in the process of installing some racks with Lenovo SD665 V3 [1] water-cooled servers. A Lenovo DW612S chassis contains 6 1U trays with 2 SD665 V3 servers mounted side-by-side in each tray. Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand "SharedIO" adapters [2]

[slurm-users] Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-23 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is

[slurm-users] "Optimal" slurm configuration

2024-02-23 Thread Max Grönke via slurm-users
Hello! In our current cluster the workflows are quite diverse (bunch of large, long (24-72h) jobs; medium size <4h job; and many small 1 node jobs). The current priority is fair share only (averaged on a ~few months timescale). For the new setup we would like to (1) discourage the 1 node jobs [espe

[slurm-users] Re: Slurm management of dual-node server trays?

2024-02-23 Thread Sid Young via slurm-users
Thats a Very interesting design and looking at the SD665 V3 documentation am I correct each node has dual 25GBs SFP28 interfaces? If so, the despite dual nodes in a 1u configuration, you actually have 2 separate servers? Sid On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users, < slurm-u

[slurm-users] slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of JobId=

2024-02-23 Thread Robert Kudyba via slurm-users
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp srun with --x11 was working before changing this. We're on RHEL 9. slurmctld logs show this whenever --x11 is used

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-23 Thread Christopher Samuel via slurm-users
Hi Robert, On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] > [2024-02-23T20:26:15.881] [530.exter