We switched over from using systemctl for tmp.mount and change to zram,
modprobe zram
echo 20GB > /sys/block/zram0/disksize
mkfs.xfs /dev/zram0
mount -o discard /dev/zram0 /tmp

srun with --x11 was working before changing this. We're on RHEL 9.

slurmctld logs show this whenever --x11 is used with srun:
[2024-02-23T20:22:43.442] [529.extern] error: setup_x11_forward: failed to
create temporary XAUTHORITY file: Permission denied
[2024-02-23T20:22:43.442] [529.extern] error: x11 port forwarding setup
[2024-02-23T20:22:43.442] error: _forkexec_slurmstepd: slurmstepd failed to
send return code got 0: Resource temporarily unavailable
[2024-02-23T20:22:43.443] Could not launch job 529 and not able to requeue
it, cancelling job
[2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to
create temporary XAUTHORITY file: Permission denied
[2024-02-23T20:26:15.881] [530.extern] error: x11 port forwarding setup
[2024-02-23T20:26:15.882] error: _forkexec_slurmstepd: slurmstepd failed to
send return code got 0: Resource temporarily unavailable
[2024-02-23T20:26:15.883] Could not launch job 530 and not able to requeue
it, cancelling job

slurmd log entries from a node:
[2024-02-23T20:26:15.859] sched: _slurm_rpc_allocate_resources JobId=530
NodeList=2402-node005 usec=1800
[2024-02-23T20:26:15.882] _slurm_rpc_requeue: Requeue of JobId=530 returned
an error: Only batch jobs are accepted or processed
[2024-02-23T20:26:15.883] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=530
uid 0
[2024-02-23T20:26:15.962] _slurm_rpc_complete_job_allocation: JobId=530
error Job/step already completing or completed

srun -v --pty  -t 0-4:00 --x11 --mem=10g
srun: defined options
srun: -------------------- --------------------
srun: account             : me
srun: mem                 : 10G
srun: nodelist            : our-node
srun: pty                 :
srun: time                : 04:00:00
srun: verbose             : 1
srun: x11                 : all
srun: -------------------- --------------------
srun: end of defined options
srun: Waiting for resource configuration
srun: error: Nodes our-node are still not ready
srun: error: Something is wrong with the boot of the nodes.

slurm.conf has PrologFlags=x11 set. /usr/bin/xauth is installed on each
compute node.

Is this a known issue with zram or is that just a red herring and there's
something else wrong?
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to