Sorry for this very late response. The directory where job containers are to be created is of course already there - it is the local filesystem. We also start slurmd as a very last process once a node is ready to accept jobs. That seems to be either a feature of salloc or a bug in Slurm, presumable caused by some race conditions - in very rare cases, salloc works without this issue. I see that doc on the Slurm power saving mentions about salloc, but not for the case of interactive use of it.
Thank you & best regards Gizo > On 27/10/22 4:18 am, Gizo Nanava wrote: > > > we run into another issue when using salloc interactively on a cluster > > where Slurm > > power saving is enabled. The problem seems to be caused by the > > job_container plugin > > and occurs when the job starts on a node which boots from a power down > > state. > > If I resubmit a job immediately after the failure to the same node, it > > always works. > > I can't find any other way to reproduce the issue other than booting a > > reserved node from a power down state. > > Looking at this: > > > slurmstepd: error: container_p_join: open failed for > > /scratch/job_containers/791670/.ns: No such file or directory > > I'm wondering is a separate filesystem and, if so, could /scratch be > only getting mounted _after_ slurmd has started on the node? > > If that's the case then it would explain the error and why it works > immediately after. > > On our systems we always try and ensure that slurmd is the very last > thing to start on a node, and it only starts if everything has succeeded > up to that point. > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > >