Hi all, Trying to get cgroups working with SoGE 8.1.8 and Centos 7. I have the basic cgroup functionality working in the OS, cgred and cgconfig services enabled,
Modified the "setup-cgroups-etc" script to use cgroups instead of cpuset: #mount -t cpuset none $cpuset_mnt >/dev/null 2>&1 mount -t cgroup -ocpuset cpuset $cpuset_mnt The "setup-cgroups-etc" is being called by the sgeexecd at startup: $bin_dir/sge_execd /usr/local/sge/util/resources/scripts/setup-cgroups-etc start After rebooting the test node /proc/self/cgroup exists, and the proper directories are being created under /dev/cpuset/sge: -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.clone_children --w--w--w- 1 sgeadmin root 0 Aug 10 16:54 cgroup.event_control -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.procs -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpu_exclusive -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpus -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_exclusive -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_hardwall -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_migrate -r--r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_pressure -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_page -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_slab -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mems -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_load_balance -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_relax_domain_level -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 notify_on_release -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 tasks qconf -sconf shows: execd_params ENABLE_ADDGRP_KILL=TRUE ENABLE_BINDING=true \ USE_CGROUPS The problem is that when we submit a job, the queue on that node goes into an error state, and the sge messages for that node show: 08/15/2016 14:50:55| main|moose11|E|shepherd of job 222673.1 died through signal = 6 08/15/2016 14:50:55| main|moose11|E|abnormal termination of shepherd for job 222673.1: no "exit_status" file 08/15/2016 14:50:55| main|moose11|E|can't open file active_jobs/222673.1/error: No such file or directory 08/15/2016 14:50:55| main|moose11|E|can't open pid file "active_jobs/222673.1/pid" for job 222673.1 Thoughts? -Dj _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users