Re: [slurm-users] detectCores() mess

Jeff White Fri, 08 Dec 2017 12:51:36 -0800

On 12/08/2017 09:54 AM, Mike Cammilleri wrote:

Hi,


We have allowed some courses to use our slurm cluster for teaching purposes, 
which of course leads to all kinds of exciting experiments - not always the 
most clever programming but it certainly teaches me where we need tighten up 
configurations.

The default method of thinking for many students just starting out is to grab 
as much CPU as possible - not fully understanding cluster computing and batch 
scheduling. One example I see often is students using the R parallel package 
and calling detectCores(), which of course is returning all the cores linux 
reports. They also did not specify --ntasks, so slurm assigns 1 of course - but 
there is no check on the ballooning of R processes created with detectCores() 
and then whatever they're doing with that number. Now we have overloaded nodes.

I see that availableCores() is suggested as a more friendly method for shared 
resources like this, where it would return the number of cores that were 
assigned via SLURM. Therefore, a student using the parallel package would need 
to explicitly specify the number of cores in their submit file. This would be 
nice IF students voluntarily used availableCores() instead of detectCores(), 
but we know that's not really enforceable.

I thought cgroups (which we are using) would prevent some of this behavior on 
the nodes (we are constraining CPU and RAM) -I'd like there to be no I/O wait 
times if possible. I would like it if either linux or slurm could constrain a 
job from grabbing more cores than assigned at submit time. Is there something 
else I should be configuring to safeguard against this behavior? If SLURM 
assigns 1 cpu to the task then no matter what craziness is in the code, 1 is 
all they're getting. Possible?

Thanks for any insight!

--mike

Sounds like you are looking for CPU affinity. We use the feature tolimit jobs to the number of cores they request. The job can see theother cores (e.g. by looking at /proc/cpuinfo) but if they try to usethem their processes will not get scheduled on any but the coresassigned to the job.


# cat /etc/slurm/slurm.conf
...
TaskPlugin=task/cgroup,task/affinity
TaskPluginParam=sched,autobind=cores

# cat /etc/slurm/cgroup.conf
CgroupAutomount=yes
ConstrainCores=yes

$ taskset -c -p $$ # within a job requesting 4 cores on a node with 28
pid 29247's current affinity list: 15,17,19,21

--
Jeff White
HPC Systems Engineer - ITS

Question about or help with Kamiak? Please submit a Service Request<https://hpc.wsu.edu/support/service-requests/>.

Re: [slurm-users] detectCores() mess

Reply via email to