On Mon, Aug 15, 2016 at 03:25:06PM -0400, s...@deej.net wrote:
> Hi all,
>       Trying to get cgroups working with SoGE 8.1.8 and Centos 7.  I have the
> basic cgroup functionality working in the OS, cgred and cgconfig
> services enabled,
> 
> Modified the "setup-cgroups-etc" script to use cgroups instead of cpuset:
> 
>         #mount -t cpuset none $cpuset_mnt >/dev/null 2>&1
>         mount -t cgroup -ocpuset cpuset $cpuset_mnt
> 
> The "setup-cgroups-etc" is being called by the sgeexecd at startup:
> 
>       $bin_dir/sge_execd
>       /usr/local/sge/util/resources/scripts/setup-cgroups-etc start
> 
> After rebooting the test node /proc/self/cgroup exists, and the proper
> directories are being created under /dev/cpuset/sge:
> 
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.clone_children
> --w--w--w- 1 sgeadmin root 0 Aug 10 16:54 cgroup.event_control
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.procs
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpu_exclusive
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpus
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_exclusive
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_hardwall
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_migrate
> -r--r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_pressure
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_page
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_slab
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mems
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_load_balance
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_relax_domain_level
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 notify_on_release
> -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 tasks
> 
> 
> qconf -sconf shows:
> 
> execd_params                 ENABLE_ADDGRP_KILL=TRUE ENABLE_BINDING=true \
>                              USE_CGROUPS
> 
> 
> The problem is that when we submit a job, the queue on that node goes
> into an error state, and the sge messages for that node show:
> 
> 
> 08/15/2016 14:50:55|  main|moose11|E|shepherd of job 222673.1 died
> through signal = 6
> 08/15/2016 14:50:55|  main|moose11|E|abnormal termination of shepherd
> for job 222673.1: no "exit_status" file
> 08/15/2016 14:50:55|  main|moose11|E|can't open file
> active_jobs/222673.1/error: No such file or directory
> 08/15/2016 14:50:55|  main|moose11|E|can't open pid file
> "active_jobs/222673.1/pid" for job 222673.1
> 
> 
> Thoughts?
You may need to upgrade to 8.1.9 IIRC there were some cgroup/cpuset fixes there.

Not sure if ENABLE_ADDGRP_KILL=TRUE is compatible with USE_CGROUPS as they both 
provide
ways to find processes that belong to the job and kill them.  Try using just  
USE_CGROUPS.

Also is this job a serial one or a parallel job?  There are bugs in the SGE 
cgroup support
WRT some parallel libraries IIRC.


William

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to