[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote: > Hi Robert, > > On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: > > > We switched over from using systemctl for tmp.mount and change to zram, > > e.g., > > modprobe zram > > echo 20GB > /sys/block/zram0/disksize > > mkfs.xfs /dev/zram0 > > mount -o discard /dev/zram0 /tmp > [...]

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Chris Samuel via slurm-users
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote: For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option? Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote: > On 24/2/24 06:14, Robert Kudyba via slurm-users wrote: > > > For now I just set it to chmod 777 on /tmp and that fixed the errors. Is > > there a better option? > > Traditionally /tmp and /var/tmp have been 1777 (that "1" being the > sticky bit, originally invented to indicate that the

[slurm-users] FAQ describing how to hold a job ignores scontrol subcommands specifically for that purpose

2024-02-24 Thread urbanjost via slurm-users
There are scontrol subcommands uhold/hold/release/requeuehold that are ignored when describing how to place a job on hold in FAQ 21; and it is never explained why the method described therein is the best method, it just states it is. Does anyone know why the FAQ method is better than using the s

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
Now what would be causing this? The srun just hangs and these are the only logs from slurmctld: [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node007 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node006 [2024-02-24T23:23:26.003] error: Orph