Hi Chris, all, We've been having similar issues, seemingly since upgrading to Slurm 22.05.x, where job steps in batch jobs submitted from interactive sessions fail sporadically:
1. User SSHs to login node. 2. User runs 'srun --pty /bin/bash' to get an interactive session on a worker node 3. From that interactive session the user submits a batch job containing >=1 explicit job step 4. The job step then _might_ fail with something like: srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x2. srun: error: Task launch for StepId=372.0 failed on node px01: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request srun: Job step aborted This seems to be due to SLURM_CPU_BIND_* env vars being set in the interactive job, which then (undesirably) propagate to the batch job and cause problems if the job's taskset conflicts with the inherited SLURM_CPU_BIND_* values. Unsetting those env vars at the top of the job submission script seems to prevent the issue from occurring, but isn't something we want to recommend to users. Also, we're concerned that propagation of other env vars from the interactive job to the batch might cause other issues. We thought that SLURM_EXPORT_ENV / SBATCH_EXPORT could help here but the docs for those features say: "Note that SLURM_* variables are always propagated." Has anything changed in 22.05 that could explain this? The only relevant things I can spot in the changelog that might be related are: -- Fail srun when using invalid `--cpu-bind` options (e.g. `--cpu-bind=map_cpu:99` when only 10 cpus are allocated). -- `srun --overlap` now allows the step to share all resources (CPUs, memory, and GRES), where previously `--overlap` only allowed the step to share CPUs with other steps. NB this has also been discussed on the Slurm Bugzilla ( https://bugs.schedmd.com/show_bug.cgi?id=14298). Regards, Will On Fri, 10 Jun 2022 at 14:55, Rutledge, Chris <crutle...@renci.org> wrote: > Hello Everyone, > > Having an odd issue with the latest version of slurm (22.05.0) when > submitting jobs to the queue while on a compute resource. Some jobs are > unable to reproduce this issue every time, but I've got a few that will. > Here's one case that consistently errors when trying to launch. I've not > been able to reproduce the issue when submitting jobs from the login node. > > Anyone seen anything like this? > > ############################## > # start interactive session > ############################## > [crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l > [crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/ > > ############################## > # job details > ############################## > [crutledge@largemem-5-1 gpu-6]$ cat job > #!/bin/bash -l > # > #SBATCH --job-name=HPCC > #SBATCH -n 48 > #SBATCH -p gpu > #SBATCH --mem-per-cpu=3975 > > module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel > > srun ./hpcc > > mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID} > > ############################## > # submit the job > ############################## > [crutledge@largemem-5-1 gpu-6]$ sbatch job > Submitted batch job 8533 > > ############################## > # resulting error > ############################## > [crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out > Loading icc version 2022.0.2 > Loading compiler-rt version 2022.0.2 > srun: error: CPU binding outside of job step allocation, allocated CPUs > are: 0x000000000001000000000001. > srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable > to satisfy cpu bind request > srun: error: Application launch failed: Unable to satisfy cpu bind request > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT > 2022-06-10T09:38:19 *** > srun: error: gpu-5-1: tasks 0-46: Killed > mv: cannot stat ‘hpccoutf.txt’: No such file or directory > [crutledge@largemem-5-1 gpu-6]$ > -- Dr Will Furnass | Research Platforms Engineer IT Services | University of Sheffield +44 (0)114 22 29693