Please ignore the question - the option SchedulerParameters=salloc_wait_nodes 
solves the issue.

kind regards 
Gizo


> Hello, 
> 
> it seems that in a cluster configured for power saving, salloc does not wait 
> until the nodes 
> assigned to the job recover from the power down state and go back to normal 
> operation
> 
> Although the job is in the state CONFIGURING and the node are still in 
> IDLE+NOT_RESPONDING+POWERING_UP,
> the nodes are declared ready for the job and srun is invoked (on our cluster, 
> salloc is configured 
> for an interactive use. We have LaunchParameters=use_interactive_step in 
> slurm.conf), 
> which of course fails as the nodes are still booting.
> 
> Is this the expected behavior of salloc ?
> 
> Srun and sbatch work as expected.
> 
> We use Slurm 22.05.3
> 
> > salloc --nodelist=taurus-n008
> ......
> salloc: Waiting for resource configuration
> salloc: Nodes taurus-n008 are ready for job
> srun: error: Task launch for StepId=766789.interactive failed on node 
> taurus-n008: Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted
> salloc: Relinquishing job allocation 766789
> 
> > scontrol show nodes taurus-n008
> ......
> State=IDLE+NOT_RESPONDING+POWERING_UP
> ....
> 
> > scontrol show job 766789
> .....
> JobState=CONFIGURING Reason=None Dependency=(null)
> NodeList=taurus-n008
> 
> Thank you & kind regards
> Gizo
>

Reply via email to