Hello, we run into another issue when using salloc interactively on a cluster where Slurm power saving is enabled. The problem seems to be caused by the job_container plugin and occurs when the job starts on a node which boots from a power down state. If I resubmit a job immediately after the failure to the same node, it always works. I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.
Is this a known issue? srun and sbatch don't have the problem. We use slurm 22.05.3. > salloc --nodelist=isu-n001 salloc: Granted job allocation 791670 salloc: Waiting for resource configuration salloc: Nodes isu-n001 are ready for job slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory slurmstepd: error: container_g_join failed: 791670 slurmstepd: error: write to unblock task 0 failed: Broken pipe srun: error: isu-n001: task 0: Exited with exit code 1 srun: launch/slurm: _step_signal: Terminating StepId=791670.interactive salloc: Relinquishing job allocation 791670 # Slurm controller configs # > cat /etc/slurm/slurm.conf .. JobContainerType=job_container/tmpfs .. LaunchParameters=use_interactive_step InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l" # Job_container # > cat /etc/slurm/job_container.conf AutoBasePath=true BasePath=/scratch/job_containers Thank you & kind regards Gizo