JFYI: I've found some more issues while hamerring this more. Please
ignore this and the follow up patch for now. If others are OK with the
cleanups preceding this patch I will repost with the changes based on
the feedback so far and let them merge into mm tree before I settle
with this much more tricky part.

On Wed 08-07-15 14:27:51, Michal Hocko wrote:
> From: Michal Hocko <mho...@suse.cz>
> 
> mm_struct::owner keeps track of the task which is in charge for the
> specific mm. This is usually the thread group leader of the process but
> there are exotic cases where this doesn't hold.
> 
> The most prominent one is when separate tasks (not in the same thread
> group) share the address space (by using clone with CLONE_VM without
> CLONE_THREAD). The first task will be the owner until it exits.
> mm_update_next_owner will then try to find a new owner - a task which
> points to the same mm_struct. There is no guarantee a new owner will
> be a thread group leader though because the leader for that thread
> group might have exited. Even though such a thread will be still around
> waiting for the remaining threads from its group, it's mm will be NULL
> so it cannot be chosen.
> 
> cgroup migration code, however assumes only group leaders when migrating
> via cgroup.procs (which will be the only mode in the unified hierarchy
> API) while mem_cgroup_can_attach considers only those tasks which are
> owner of the mm. So we might end up with tasks which cannot be migrated.
> mm_update_next_owner could be tweaked to try harder and use a group
> leader whenever possible but this will never be 100% because all the
> leaders might be dead. It seems that getting rid of the mm->owner sounds
> like a better and less hacky option.
> 
> The whole concept of the mm owner is a bit artificial and too tricky to
> get right. All the memcg code needs is to find struct mem_cgroup from
> a given mm_struct and there are only two events when the association
> is either built or changed
>       - a new mm is created - dup_mmm resp exec_mmap - when the memcg
>         is inherited from the oldmm
>       - task associated with the mm is moved to another memcg
> So it is much more easier to bind mm_struct with the mem_cgroup directly
> rather than indirectly via a task. This is exactly what this patch does.
> 
> mm_inherit_memcg and mm_drop_memcg are exported for the core kernel
> to bind an old memcg during dup_mm (fork) resp. exec_mmap (exec) and
> releasing that memcg in mmput after the last reference is dropped and no
> task sees the mm anymore. We have to be careful and take a reference to
> the memcg->css so that it doesn't vanish from under our feet.
> 
> The only remaining part is to catch task migration and change the
> association. This is done in mem_cgroup_move_task before charges get
> moved because mem_cgroup_can_attach is too early and other controllers
> might fail and we would have to handle the rollback.
> 
> mm->memcg conforms to standard mem_cgroup locking rules. It has to be
> used inside rcu_read_{un}lock() and a reference has to be taken before the
> unlock if the memcg is supposed to be used outside.
> 
> Finally mem_cgroup_can_attach will allow task migration only for the
> thread group leaders to conform with cgroup core requirements.
> 
> Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
> Without mm->owner _all_ tasks (group leaders to be precise) associated
> with the mm_struct would initiate memcg migration while previously
> only owner of the mm_struct could do that. The original behavior was
> awkward though because the user task didn't have any means to find out
> the current owner (esp. after mm_update_next_owner) so the migration
> behavior was not well defined in general.
> New cgroup API (unified hierarchy) will discontinue tasks cgroup file
> which means that migrating threads will no longer be possible. In such
> a case having CLONE_VM without CLONE_THREAD could emulate the thread
> behavior but this patch prevents from isolating memcg controllers from
> others. Nevertheless I am not convinced such a use case would really
> deserve complications on the memcg code side.
> 
> Suggested-by: Oleg Nesterov <o...@redhat.com>
> Signed-off-by: Michal Hocko <mho...@suse.cz>
> ---
>  fs/exec.c                  |   2 +-
>  include/linux/memcontrol.h |  58 ++++++++++++++++++++++++--
>  include/linux/mm_types.h   |  12 +-----
>  kernel/exit.c              |  89 ---------------------------------------
>  kernel/fork.c              |  10 +----
>  mm/debug.c                 |   4 +-
>  mm/memcontrol.c            | 101 
> ++++++++++++++++++++++++++++-----------------
>  7 files changed, 123 insertions(+), 153 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 1977c2a553ac..3ed9c0abc9f5 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -870,7 +870,7 @@ static int exec_mmap(struct mm_struct *mm)
>               up_read(&old_mm->mmap_sem);
>               BUG_ON(active_mm != old_mm);
>               setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> -             mm_update_next_owner(old_mm);
> +             mm_inherit_memcg(mm, old_mm);
>               mmput(old_mm);
>               return 0;
>       }
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 78e9d4ac57a1..8e6b2444ebfe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -274,6 +274,52 @@ struct mem_cgroup {
>  extern struct cgroup_subsys_state *mem_cgroup_root_css;
>  
>  /**
> + * __mm_set_memcg - Set mm_struct:memcg to a given memcg.
> + * @mm: mm struct
> + * @memcg: mem_cgroup to be used
> + *
> + * Note that this function doesn't clean up the previous mm->memcg.
> + * This should be done by caller when necessary (e.g. when moving
> + * mm from one memcg to another).
> + */
> +static inline
> +void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> +{
> +     if (memcg)
> +             css_get(&memcg->css);
> +     rcu_assign_pointer(mm->memcg, memcg);
> +}
> +
> +/**
> + * mm_inherit_memcg - Initialize mm_struct::memcg from an existing mm_struct
> + * @newmm: new mm struct
> + * @oldmm: old mm struct to inherit from
> + *
> + * Should be called for each new mm_struct.
> + */
> +static inline
> +void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
> +{
> +     struct mem_cgroup *memcg = oldmm->memcg;
> +
> +     __mm_set_memcg(newmm, memcg);
> +}
> +
> +/**
> + * mm_drop_iter - drop mm_struct::memcg association
> + * @mm: mm struct
> + *
> + * Should be called after the mm has been removed from all tasks
> + * and before it is freed (e.g. from mmput)
> + */
> +static inline void mm_drop_memcg(struct mm_struct *mm)
> +{
> +     if (mm->memcg)
> +             css_put(&mm->memcg->css);
> +     mm->memcg = NULL;
> +}
> +
> +/**
>   * mem_cgroup_events - count memory events against a cgroup
>   * @memcg: the memory cgroup
>   * @idx: the event index
> @@ -305,7 +351,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, 
> struct zone *);
>  bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>  
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  
>  struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
>  static inline
> @@ -335,7 +380,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
>       bool match = false;
>  
>       rcu_read_lock();
> -     task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> +     task_memcg = rcu_dereference(mm->memcg);
>       if (task_memcg)
>               match = mem_cgroup_is_descendant(task_memcg, memcg);
>       rcu_read_unlock();
> @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct 
> mm_struct *mm,
>               return;
>  
>       rcu_read_lock();
> -     memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> +     memcg = rcu_dereference(mm->memcg);
>       if (unlikely(!memcg))
>               goto out;
>  
> @@ -498,6 +543,13 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> +static inline void mm_inherit_memcg(struct mm_struct *newmm, struct 
> mm_struct *oldmm)
> +{
> +}
> +static inline void mm_drop_memcg(struct mm_struct *mm)
> +{
> +}
> +
>  static inline void mem_cgroup_events(struct mem_cgroup *memcg,
>                                    enum mem_cgroup_events_index idx,
>                                    unsigned int nr)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index f6266742ce1f..93dc8cb9c636 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -426,17 +426,7 @@ struct mm_struct {
>       struct kioctx_table __rcu       *ioctx_table;
>  #endif
>  #ifdef CONFIG_MEMCG
> -     /*
> -      * "owner" points to a task that is regarded as the canonical
> -      * user/owner of this mm. All of the following must be true in
> -      * order for it to be changed:
> -      *
> -      * current == mm->owner
> -      * current->mm != mm
> -      * new_owner->mm == mm
> -      * new_owner->alloc_lock is held
> -      */
> -     struct task_struct __rcu *owner;
> +     struct mem_cgroup __rcu *memcg;
>  #endif
>  
>       /* store ref to file /proc/<pid>/exe symlink points to */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 185752a729f6..339554612677 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -292,94 +292,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct 
> task_struct *parent)
>       }
>  }
>  
> -#ifdef CONFIG_MEMCG
> -/*
> - * A task is exiting.   If it owned this mm, find a new owner for the mm.
> - */
> -void mm_update_next_owner(struct mm_struct *mm)
> -{
> -     struct task_struct *c, *g, *p = current;
> -
> -retry:
> -     /*
> -      * If the exiting or execing task is not the owner, it's
> -      * someone else's problem.
> -      */
> -     if (mm->owner != p)
> -             return;
> -     /*
> -      * The current owner is exiting/execing and there are no other
> -      * candidates.  Do not leave the mm pointing to a possibly
> -      * freed task structure.
> -      */
> -     if (atomic_read(&mm->mm_users) <= 1) {
> -             mm->owner = NULL;
> -             return;
> -     }
> -
> -     read_lock(&tasklist_lock);
> -     /*
> -      * Search in the children
> -      */
> -     list_for_each_entry(c, &p->children, sibling) {
> -             if (c->mm == mm)
> -                     goto assign_new_owner;
> -     }
> -
> -     /*
> -      * Search in the siblings
> -      */
> -     list_for_each_entry(c, &p->real_parent->children, sibling) {
> -             if (c->mm == mm)
> -                     goto assign_new_owner;
> -     }
> -
> -     /*
> -      * Search through everything else, we should not get here often.
> -      */
> -     for_each_process(g) {
> -             if (g->flags & PF_KTHREAD)
> -                     continue;
> -             for_each_thread(g, c) {
> -                     if (c->mm == mm)
> -                             goto assign_new_owner;
> -                     if (c->mm)
> -                             break;
> -             }
> -     }
> -     read_unlock(&tasklist_lock);
> -     /*
> -      * We found no owner yet mm_users > 1: this implies that we are
> -      * most likely racing with swapoff (try_to_unuse()) or /proc or
> -      * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
> -      */
> -     mm->owner = NULL;
> -     return;
> -
> -assign_new_owner:
> -     BUG_ON(c == p);
> -     get_task_struct(c);
> -     /*
> -      * The task_lock protects c->mm from changing.
> -      * We always want mm->owner->mm == mm
> -      */
> -     task_lock(c);
> -     /*
> -      * Delay read_unlock() till we have the task_lock()
> -      * to ensure that c does not slip away underneath us
> -      */
> -     read_unlock(&tasklist_lock);
> -     if (c->mm != mm) {
> -             task_unlock(c);
> -             put_task_struct(c);
> -             goto retry;
> -     }
> -     mm->owner = c;
> -     task_unlock(c);
> -     put_task_struct(c);
> -}
> -#endif /* CONFIG_MEMCG */
> -
>  /*
>   * Turn us into a lazy TLB process if we
>   * aren't already..
> @@ -433,7 +345,6 @@ static void exit_mm(struct task_struct *tsk)
>       up_read(&mm->mmap_sem);
>       enter_lazy_tlb(mm, current);
>       task_unlock(tsk);
> -     mm_update_next_owner(mm);
>       mmput(mm);
>       if (test_thread_flag(TIF_MEMDIE))
>               exit_oom_victim();
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 16e0f872f084..d073b6249d98 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -570,13 +570,6 @@ static void mm_init_aio(struct mm_struct *mm)
>  #endif
>  }
>  
> -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
> -{
> -#ifdef CONFIG_MEMCG
> -     mm->owner = p;
> -#endif
> -}
> -
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>  {
>       mm->mmap = NULL;
> @@ -596,7 +589,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p)
>       spin_lock_init(&mm->page_table_lock);
>       mm_init_cpumask(mm);
>       mm_init_aio(mm);
> -     mm_init_owner(mm, p);
>       mmu_notifier_mm_init(mm);
>       clear_tlb_flush_pending(mm);
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> @@ -702,6 +694,7 @@ void mmput(struct mm_struct *mm)
>               }
>               if (mm->binfmt)
>                       module_put(mm->binfmt->module);
> +             mm_drop_memcg(mm);
>               mmdrop(mm);
>       }
>  }
> @@ -926,6 +919,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
>       if (mm->binfmt && !try_module_get(mm->binfmt->module))
>               goto free_pt;
>  
> +     mm_inherit_memcg(mm, oldmm);
>       return mm;
>  
>  free_pt:
> diff --git a/mm/debug.c b/mm/debug.c
> index 3eb3ac2fcee7..d0347a168651 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -184,7 +184,7 @@ void dump_mm(const struct mm_struct *mm)
>               "ioctx_table %p\n"
>  #endif
>  #ifdef CONFIG_MEMCG
> -             "owner %p "
> +             "memcg %p "
>  #endif
>               "exe_file %p\n"
>  #ifdef CONFIG_MMU_NOTIFIER
> @@ -218,7 +218,7 @@ void dump_mm(const struct mm_struct *mm)
>               mm->ioctx_table,
>  #endif
>  #ifdef CONFIG_MEMCG
> -             mm->owner,
> +             mm->memcg,
>  #endif
>               mm->exe_file,
>  #ifdef CONFIG_MMU_NOTIFIER
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19ffae804076..4069ec8f52be 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -294,6 +294,18 @@ static inline struct mem_cgroup 
> *mem_cgroup_from_id(unsigned short id)
>       return mem_cgroup_from_css(css);
>  }
>  
> +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> +{
> +     if (p->mm)
> +             return rcu_dereference(p->mm->memcg);
> +
> +     /*
> +      * If the process doesn't have mm struct anymore we have to fallback
> +      * to the task_css.
> +      */
> +     return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> +}
> +
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  
> @@ -783,19 +795,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, 
> struct page *page)
>       }
>  }
>  
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> -{
> -     /*
> -      * mm_update_next_owner() may clear mm->owner to NULL
> -      * if it races with swapoff, page migration, etc.
> -      * So this can be called with p == NULL.
> -      */
> -     if (unlikely(!p))
> -             return NULL;
> -
> -     return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> -}
> -
>  static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
>  {
>       struct mem_cgroup *memcg = NULL;
> @@ -810,7 +809,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct 
> mm_struct *mm)
>               if (unlikely(!mm))
>                       memcg = root_mem_cgroup;
>               else {
> -                     memcg = 
> mem_cgroup_from_task(rcu_dereference(mm->owner));
> +                     memcg = rcu_dereference(mm->memcg);
>                       if (unlikely(!memcg))
>                               memcg = root_mem_cgroup;
>               }
> @@ -2286,7 +2285,7 @@ void __memcg_kmem_put_cache(struct kmem_cache *cachep)
>  }
>  
>  /*
> - * We need to verify if the allocation against current->mm->owner's memcg is
> + * We need to verify if the allocation against current->mm->memcg is
>   * possible for the given order. But the page is not allocated yet, so we'll
>   * need a further commit step to do the final arrangements.
>   *
> @@ -4737,7 +4736,7 @@ static void mem_cgroup_clear_mc(void)
>  static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>                                struct cgroup_taskset *tset)
>  {
> -     struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +     struct mem_cgroup *to = mem_cgroup_from_css(css);
>       struct mem_cgroup *from;
>       struct task_struct *p;
>       struct mm_struct *mm;
> @@ -4749,37 +4748,49 @@ static int mem_cgroup_can_attach(struct 
> cgroup_subsys_state *css,
>        * tunable will only affect upcoming migrations, not the current one.
>        * So we need to save it, and keep it going.
>        */
> -     move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
> +     move_flags = READ_ONCE(to->move_charge_at_immigrate);
>       if (!move_flags)
>               return 0;
>  
>       p = cgroup_taskset_first(tset);
> -     from = mem_cgroup_from_task(p);
> -
> -     VM_BUG_ON(from == memcg);
> +     if (!thread_group_leader(p))
> +             return 0;
>  
>       mm = get_task_mm(p);
>       if (!mm)
>               return 0;
> -     /* We move charges only when we move a owner of the mm */
> -     if (mm->owner == p) {
> -             VM_BUG_ON(mc.from);
> -             VM_BUG_ON(mc.to);
> -             VM_BUG_ON(mc.precharge);
> -             VM_BUG_ON(mc.moved_charge);
> -             VM_BUG_ON(mc.moved_swap);
> -
> -             spin_lock(&mc.lock);
> -             mc.from = from;
> -             mc.to = memcg;
> -             mc.flags = move_flags;
> -             spin_unlock(&mc.lock);
> -             /* We set mc.moving_task later */
> -
> -             ret = mem_cgroup_precharge_mc(mm);
> -             if (ret)
> -                     mem_cgroup_clear_mc();
> -     }
> +
> +     /*
> +      * tasks' cgroup might be different from the one p->mm is associated
> +      * with because CLONE_VM is allowed without CLONE_THREAD. The task is
> +      * moving so we have to migrate from the memcg associated with its
> +      * address space.
> +      * No need to take a reference here because the memcg is pinned by the
> +      * mm_struct.
> +      */
> +     from = READ_ONCE(mm->memcg);
> +     if (!from)
> +             from = root_mem_cgroup;
> +     if (from == to)
> +             goto out;
> +
> +     VM_BUG_ON(mc.from);
> +     VM_BUG_ON(mc.to);
> +     VM_BUG_ON(mc.precharge);
> +     VM_BUG_ON(mc.moved_charge);
> +     VM_BUG_ON(mc.moved_swap);
> +
> +     spin_lock(&mc.lock);
> +     mc.from = from;
> +     mc.to = to;
> +     mc.flags = move_flags;
> +     spin_unlock(&mc.lock);
> +     /* We set mc.moving_task later */
> +
> +     ret = mem_cgroup_precharge_mc(mm);
> +     if (ret)
> +             mem_cgroup_clear_mc();
> +out:
>       mmput(mm);
>       return ret;
>  }
> @@ -4932,14 +4943,26 @@ static void mem_cgroup_move_task(struct 
> cgroup_subsys_state *css,
>  {
>       struct task_struct *p = cgroup_taskset_first(tset);
>       struct mm_struct *mm = get_task_mm(p);
> +     struct mem_cgroup *old_memcg = NULL;
>  
>       if (mm) {
> +             old_memcg = READ_ONCE(mm->memcg);
> +             __mm_set_memcg(mm, mem_cgroup_from_css(css));
> +
>               if (mc.to)
>                       mem_cgroup_move_charge(mm);
>               mmput(mm);
>       }
>       if (mc.to)
>               mem_cgroup_clear_mc();
> +
> +     /*
> +      * Be careful and drop the reference only after we are done because
> +      * p's task_css memcg might be different from p->memcg and nothing else
> +      * might be pinning the old memcg.
> +      */
> +     if (old_memcg)
> +             css_put(&old_memcg->css);
>  }
>  #else        /* !CONFIG_MMU */
>  static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
> -- 
> 2.1.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to