Hi Chris, thanks for following up on this thread.
First of all, you will want to use cgroups to ensure that processes that do not request GPUs cannot access them.
We had a feeling that cgroups might be more optimal. Could you point us to documentation that suggests cgroups to be a requirement?
Secondly, do your CPUs have hyperthreading enabled by some chance? If so then your gres.conf is likely wrong as you'll want to list the first HT on each core that you want to restrict access to.
No HT involved here at any point, neither on our cluster nor within the dockerized slurm installation I was playing with.
From the manual page for gres.conf:
NOTE: If your cores contain multiple threads only list the
first thread
of each core. The logic is such that it uses core instead
of thread
scheduling per GRES. Also note that since Slurm must be able to
perform
resource management on heterogeneous clusters having various
core ID num-
bering schemes, an abstract index will be used instead of the
physical
core index. That abstract id may not correspond to your
physical core
number. Basically Slurm starts numbering from 0 to n, being 0
the id of
the first processing unit (core or thread if HT is enabled) on
the first
socket, first core and maybe first thread, and then continuing
sequen-
tially to the next thread, core, and socket. The numbering
generally
coincides with the processing unit logical number (PU L#) seen
in lstopo
output.
We are aware of this section of the manpage. thanks. Best, Peter
smime.p7s
Description: S/MIME Cryptographic Signature
