Please ignore the question - the option SchedulerParameters=salloc_wait_nodes solves the issue.
kind regards Gizo > Hello, > > it seems that in a cluster configured for power saving, salloc does not wait > until the nodes > assigned to the job recover from the power down state and go back to normal > operation > > Although the job is in the state CONFIGURING and the node are still in > IDLE+NOT_RESPONDING+POWERING_UP, > the nodes are declared ready for the job and srun is invoked (on our cluster, > salloc is configured > for an interactive use. We have LaunchParameters=use_interactive_step in > slurm.conf), > which of course fails as the nodes are still booting. > > Is this the expected behavior of salloc ? > > Srun and sbatch work as expected. > > We use Slurm 22.05.3 > > > salloc --nodelist=taurus-n008 > ...... > salloc: Waiting for resource configuration > salloc: Nodes taurus-n008 are ready for job > srun: error: Task launch for StepId=766789.interactive failed on node > taurus-n008: Communication connection failure > srun: error: Application launch failed: Communication connection failure > srun: Job step aborted > salloc: Relinquishing job allocation 766789 > > > scontrol show nodes taurus-n008 > ...... > State=IDLE+NOT_RESPONDING+POWERING_UP > .... > > > scontrol show job 766789 > ..... > JobState=CONFIGURING Reason=None Dependency=(null) > NodeList=taurus-n008 > > Thank you & kind regards > Gizo >