On Mon, Aug 15, 2016 at 03:25:06PM -0400, s...@deej.net wrote: > Hi all, > Trying to get cgroups working with SoGE 8.1.8 and Centos 7. I have the > basic cgroup functionality working in the OS, cgred and cgconfig > services enabled, > > Modified the "setup-cgroups-etc" script to use cgroups instead of cpuset: > > #mount -t cpuset none $cpuset_mnt >/dev/null 2>&1 > mount -t cgroup -ocpuset cpuset $cpuset_mnt > > The "setup-cgroups-etc" is being called by the sgeexecd at startup: > > $bin_dir/sge_execd > /usr/local/sge/util/resources/scripts/setup-cgroups-etc start > > After rebooting the test node /proc/self/cgroup exists, and the proper > directories are being created under /dev/cpuset/sge: > > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.clone_children > --w--w--w- 1 sgeadmin root 0 Aug 10 16:54 cgroup.event_control > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.procs > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpu_exclusive > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpus > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_exclusive > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_hardwall > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_migrate > -r--r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_pressure > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_page > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_slab > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mems > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_load_balance > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_relax_domain_level > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 notify_on_release > -rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 tasks > > > qconf -sconf shows: > > execd_params ENABLE_ADDGRP_KILL=TRUE ENABLE_BINDING=true \ > USE_CGROUPS > > > The problem is that when we submit a job, the queue on that node goes > into an error state, and the sge messages for that node show: > > > 08/15/2016 14:50:55| main|moose11|E|shepherd of job 222673.1 died > through signal = 6 > 08/15/2016 14:50:55| main|moose11|E|abnormal termination of shepherd > for job 222673.1: no "exit_status" file > 08/15/2016 14:50:55| main|moose11|E|can't open file > active_jobs/222673.1/error: No such file or directory > 08/15/2016 14:50:55| main|moose11|E|can't open pid file > "active_jobs/222673.1/pid" for job 222673.1 > > > Thoughts? You may need to upgrade to 8.1.9 IIRC there were some cgroup/cpuset fixes there.
Not sure if ENABLE_ADDGRP_KILL=TRUE is compatible with USE_CGROUPS as they both provide ways to find processes that belong to the job and kill them. Try using just USE_CGROUPS. Also is this job a serial one or a parallel job? There are bugs in the SGE cgroup support WRT some parallel libraries IIRC. William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users