Re: [slurm-users] CPU binding outside of job step allocation

Will Furnass Mon, 14 Nov 2022 08:24:28 -0800

Hi Chris, all,

We've been having similar issues, seemingly since upgrading to Slurm
22.05.x, where job steps in batch jobs submitted from interactive sessions
fail sporadically:


1. User SSHs to login node.
2. User runs 'srun --pty /bin/bash' to get an interactive session on a
worker node
3. From that interactive session the user submits a batch job containing
>=1 explicit job step
4. The job step then _might_ fail with something like:

    srun: error: CPU binding outside of job step allocation, allocated CPUs
are: 0x2.
    srun: error: Task launch for StepId=372.0 failed on node px01: Unable
to satisfy cpu bind request
    srun: error: Application launch failed: Unable to satisfy cpu bind
request
    srun: Job step aborted

This seems to be due to SLURM_CPU_BIND_* env vars being set in the
interactive job, which then (undesirably) propagate to the batch job and
cause problems if the job's taskset conflicts with the inherited
SLURM_CPU_BIND_* values.

Unsetting those env vars at the top of the job submission script seems to
prevent the issue from occurring, but isn't something we want to recommend
to users.  Also, we're concerned that propagation of other env vars from
the interactive job to the batch might cause other issues.

We thought that SLURM_EXPORT_ENV / SBATCH_EXPORT could help here but the
docs for those features say: "Note that SLURM_* variables are always
propagated."

Has anything changed in 22.05 that could explain this?  The only relevant
things I can spot in the changelog that might be related are:

 -- Fail srun when using invalid `--cpu-bind` options (e.g.
`--cpu-bind=map_cpu:99` when only 10 cpus are allocated).
 -- `srun --overlap` now allows the step to share all resources (CPUs,
memory, and GRES), where previously `--overlap` only allowed the step to
share CPUs with other steps.

NB this has also been discussed on the Slurm Bugzilla (
https://bugs.schedmd.com/show_bug.cgi?id=14298).

Regards,

Will


On Fri, 10 Jun 2022 at 14:55, Rutledge, Chris <crutle...@renci.org> wrote:

> Hello Everyone,
>
> Having an odd issue with the latest version of slurm (22.05.0) when
> submitting jobs to the queue while on a compute resource. Some jobs are
> unable to reproduce this issue every time, but I've got a few that will.
> Here's one case that consistently errors when trying to launch. I've not
> been able to reproduce the issue when submitting jobs from the login node.
>
> Anyone seen anything like this?
>
> ##############################
> # start interactive session
> ##############################
> [crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l
> [crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/
>
> ##############################
> # job details
> ##############################
> [crutledge@largemem-5-1 gpu-6]$ cat job
> #!/bin/bash -l
> #
> #SBATCH --job-name=HPCC
> #SBATCH -n 48
> #SBATCH -p gpu
> #SBATCH --mem-per-cpu=3975
>
> module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel
>
> srun ./hpcc
>
> mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}
>
> ##############################
> # submit the job
> ##############################
> [crutledge@largemem-5-1 gpu-6]$ sbatch job
> Submitted batch job 8533
>
> ##############################
> # resulting error
> ##############################
> [crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out
> Loading icc version 2022.0.2
> Loading compiler-rt version 2022.0.2
> srun: error: CPU binding outside of job step allocation, allocated CPUs
> are: 0x000000000001000000000001.
> srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable
> to satisfy cpu bind request
> srun: error: Application launch failed: Unable to satisfy cpu bind request
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT
> 2022-06-10T09:38:19 ***
> srun: error: gpu-5-1: tasks 0-46: Killed
> mv: cannot stat ‘hpccoutf.txt’: No such file or directory
> [crutledge@largemem-5-1 gpu-6]$
>


-- 
Dr Will Furnass | Research Platforms Engineer
IT Services | University of Sheffield
+44 (0)114 22 29693

Re: [slurm-users] CPU binding outside of job step allocation

Reply via email to