Hello, 

we run into another issue when using salloc interactively on a cluster where 
Slurm 
power saving is enabled. The problem seems to be caused by the job_container 
plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always 
works. 
I can't find any other way to reproduce the issue other than booting a reserved 
node from a power down state.

Is this a known issue?

srun and sbatch don't have the problem.
We use slurm 22.05.3. 

>  salloc --nodelist=isu-n001
salloc: Granted job allocation 791670                               
salloc: Waiting for resource configuration                                      
                 
salloc: Nodes isu-n001 are ready for job                                        
                                                                                
                                                  
slurmstepd: error: container_p_join: open failed for 
/scratch/job_containers/791670/.ns: No such file or directory
slurmstepd: error: container_g_join failed: 791670                              
                                                                                
            
slurmstepd: error: write to unblock task 0 failed: Broken pipe      
srun: error: isu-n001: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=791670.interactive         
                   
salloc: Relinquishing job allocation 791670             

# Slurm controller configs
#
> cat /etc/slurm/slurm.conf
..
JobContainerType=job_container/tmpfs
..
LaunchParameters=use_interactive_step
InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l"
  
# Job_container
#    
> cat /etc/slurm/job_container.conf
AutoBasePath=true
BasePath=/scratch/job_containers

Thank you & kind regards
Gizo

Reply via email to