Hi all,
        Trying to get cgroups working with SoGE 8.1.8 and Centos 7.  I have the
basic cgroup functionality working in the OS, cgred and cgconfig
services enabled,

Modified the "setup-cgroups-etc" script to use cgroups instead of cpuset:

        #mount -t cpuset none $cpuset_mnt >/dev/null 2>&1
        mount -t cgroup -ocpuset cpuset $cpuset_mnt

The "setup-cgroups-etc" is being called by the sgeexecd at startup:

      $bin_dir/sge_execd
      /usr/local/sge/util/resources/scripts/setup-cgroups-etc start

After rebooting the test node /proc/self/cgroup exists, and the proper
directories are being created under /dev/cpuset/sge:

-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.clone_children
--w--w--w- 1 sgeadmin root 0 Aug 10 16:54 cgroup.event_control
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.procs
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpu_exclusive
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpus
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_exclusive
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_hardwall
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_migrate
-r--r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_pressure
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_page
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_slab
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mems
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_load_balance
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_relax_domain_level
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 notify_on_release
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 tasks


qconf -sconf shows:

execd_params                 ENABLE_ADDGRP_KILL=TRUE ENABLE_BINDING=true \
                             USE_CGROUPS


The problem is that when we submit a job, the queue on that node goes
into an error state, and the sge messages for that node show:


08/15/2016 14:50:55|  main|moose11|E|shepherd of job 222673.1 died
through signal = 6
08/15/2016 14:50:55|  main|moose11|E|abnormal termination of shepherd
for job 222673.1: no "exit_status" file
08/15/2016 14:50:55|  main|moose11|E|can't open file
active_jobs/222673.1/error: No such file or directory
08/15/2016 14:50:55|  main|moose11|E|can't open pid file
"active_jobs/222673.1/pid" for job 222673.1


Thoughts?

-Dj

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to