On 12/08/2017 09:54 AM, Mike Cammilleri wrote:
Hi,
We have allowed some courses to use our slurm cluster for teaching purposes,
which of course leads to all kinds of exciting experiments - not always the
most clever programming but it certainly teaches me where we need tighten up
configurations.
The default method of thinking for many students just starting out is to grab
as much CPU as possible - not fully understanding cluster computing and batch
scheduling. One example I see often is students using the R parallel package
and calling detectCores(), which of course is returning all the cores linux
reports. They also did not specify --ntasks, so slurm assigns 1 of course - but
there is no check on the ballooning of R processes created with detectCores()
and then whatever they're doing with that number. Now we have overloaded nodes.
I see that availableCores() is suggested as a more friendly method for shared
resources like this, where it would return the number of cores that were
assigned via SLURM. Therefore, a student using the parallel package would need
to explicitly specify the number of cores in their submit file. This would be
nice IF students voluntarily used availableCores() instead of detectCores(),
but we know that's not really enforceable.
I thought cgroups (which we are using) would prevent some of this behavior on
the nodes (we are constraining CPU and RAM) -I'd like there to be no I/O wait
times if possible. I would like it if either linux or slurm could constrain a
job from grabbing more cores than assigned at submit time. Is there something
else I should be configuring to safeguard against this behavior? If SLURM
assigns 1 cpu to the task then no matter what craziness is in the code, 1 is
all they're getting. Possible?
Thanks for any insight!
--mike
Sounds like you are looking for CPU affinity. We use the feature to
limit jobs to the number of cores they request. The job can see the
other cores (e.g. by looking at /proc/cpuinfo) but if they try to use
them their processes will not get scheduled on any but the cores
assigned to the job.
# cat /etc/slurm/slurm.conf
...
TaskPlugin=task/cgroup,task/affinity
TaskPluginParam=sched,autobind=cores
# cat /etc/slurm/cgroup.conf
CgroupAutomount=yes
ConstrainCores=yes
$ taskset -c -p $$ # within a job requesting 4 cores on a node with 28
pid 29247's current affinity list: 15,17,19,21
--
Jeff White
HPC Systems Engineer - ITS
Question about or help with Kamiak? Please submit a Service Request
<https://hpc.wsu.edu/support/service-requests/>.