Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Edmon
We also noticed the same thing with 21.08.5.  In the 21.08 series SchedMD changed the way they handle cgroups to set the stage for cgroups v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 introduced a bug fix which then caused mpirun to not pin properly (particularly for olde

Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Ward Poelmans
Hi Paul, On 10/02/2022 14:33, Paul Brunk wrote: Now we see a problem in which the OOM killer is in some cases predictably killing job steps who don't seem to deserve it.  In some cases these are job scripts and input files which ran fine before our Slurm upgrade.  More details follow, but th

Re: [slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Diego Zuccato
Tks a lot to both Steffen and Paul! That clarifies everything! Il 10/02/2022 14:11, Paul Brunk ha scritto: Hi: slurmctld runs as an unprivileged user ('slurm' by default) who probably doesn't have read access to the user's job scripts.  'sbatch' submits the scripts via network to slurmctld, w

[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Brunk
Hello all: We upgraded from 20.11.8 to 21.08.5 (CentOS 7.9, Slurm built without pmix support) recently. After that, we found that in many cases, 'mpirun' was causing multi-node MPI jobs to have all MPI ranks within a node run on the same core. We've moved on to 'srun'. Now we see a problem in w

Re: [slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Paul Brunk
Hi: slurmctld runs as an unprivileged user ('slurm' by default) who probably doesn't have read access to the user's job scripts. 'sbatch' submits the scripts via network to slurmctld, who stores them in the slurm.conf 'StateSaveLocation', and sends them to slurmds at dispatch time, who store t

Re: [slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Steffen Grunewald
On Thu, 2022-02-10 at 11:59:58 +0100, Diego Zuccato wrote: > Hello all. > > Does slurmctld (or slurmdbd) need to access the same filesystems used on > submit nodes? Or they just receive the needed information in the request? > > Does slurmctld need read access to /home/userA/myjob.sh or does it r

[slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Diego Zuccato
Hello all. Does slurmctld (or slurmdbd) need to access the same filesystems used on submit nodes? Or they just receive the needed information in the request? Say the submit node and the worker nodes mount /home via NFS. Then userA submits a job with sbatch /home/userA/myjob.sh Does slurmctl

[slurm-users] 答复: What is the 'Root/Cluster association' level in Resource Limits document mean?

2022-02-10 Thread taleintervenor
Well, ‘sacctmgr modify cluster name=***’ is exactly what we want, and inspired by this command, we found that ‘sacctmgr show cluster’ can clearly list all the cluster associations. But during test we found another problem. When limitation is defined both on cluster level and user level, the sma