On 12/02/2015 03:02 PM, bseg...@google.com wrote:
Waiman Long<waiman.l...@hpe.com>  writes:

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
    9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
    8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
    8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
    8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
    6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
    5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq&  se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

    9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
    8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
    7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
    7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
    7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
    5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
    4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

   Before patch - Max-jOPs: 907533    Critical-jOps: 134877
   After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Signed-off-by: Waiman Long<waiman.l...@hpe.com>
---
  kernel/sched/core.c  |   36 ++++++++++++++++++++++++++++++++++--
  kernel/sched/sched.h |    7 ++++++-
  2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..e39204f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
   */
  struct task_group root_task_group;
  LIST_HEAD(task_groups);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
+#endif
  #endif

  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7356,6 +7361,7 @@ void __init sched_init(void)
                root_task_group.cfs_rq = (struct cfs_rq **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

+               task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);
The KMEM_CACHE macro suggests instead adding
____cacheline_aligned_in_smp to the struct definition instead.

The main goal is to have the load_avg placed in a new cacheline separated from the read-only fields above. That is why I placed ____cacheline_aligned after load_avg. I omitted the in_smp part because it is in the SMP block already. Putting ____cacheline_aligned_in_smp won't guarantee alignment of any field within the structure.

I have done some test and having ____cacheline_aligned inside the structure has the same effect of forcing the whole structure in the cacheline aligned boundary.

  #endif /* CONFIG_FAIR_GROUP_SCHED */
  #ifdef CONFIG_RT_GROUP_SCHED
                root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
  /* task_group_lock serializes the addition/removal of task groups */
  static DEFINE_SPINLOCK(task_group_lock);

+/*
+ * Make sure that the task_group structure is cacheline aligned when
+ * fair group scheduling is enabled.
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_group *alloc_task_group(void)
+{
+       return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+       kmem_cache_free(task_group_cache, tg);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline struct task_group *alloc_task_group(void)
+{
+       return kzalloc(sizeof(struct task_group), GFP_KERNEL);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+       kfree(tg);
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
  static void free_sched_group(struct task_group *tg)
  {
        free_fair_sched_group(tg);
        free_rt_sched_group(tg);
        autogroup_free(tg);
-       kfree(tg);
+       free_task_group(tg);
  }

  /* allocate runqueue etc for a new task group */
@@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group 
*parent)
  {
        struct task_group *tg;

-       tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+       tg = alloc_task_group();
        if (!tg)
                return ERR_PTR(-ENOMEM);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..e679895 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
        unsigned long shares;

  #ifdef        CONFIG_SMP
-       atomic_long_t load_avg;
+       /*
+        * load_avg can be heavily contended at clock tick time, so put
+        * it in its own cacheline separated from the fields above which
+        * will also be accessed at each tick.
+        */
+       atomic_long_t load_avg ____cacheline_aligned;
  #endif
  #endif
I suppose the question is if it would be better to just move this to
wind up on a separate cacheline without the extra empty space, though it
would likely be more fragile and unclear.

I have been thinking about that too. The problem is anything that will be in the same cacheline as load_avg and have to be accessed at clock click time will cause the same contention problem. In the current layout, the fields after load_avg are the rt stuff as well some list head structure and pointers. The rt stuff should be kind of mutually exclusive of the CFS load_avg in term of usage. The list head structure and pointers don't seem to be that frequently accessed. So it is the right place to start a new cacheline boundary.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to