from:"Michal Hocko"

Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-26 Thread Michal Hocko

On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> > [CCing also Johannes - the thread started here:
> > https://lkml.org/lkml/2012/11/21/497]
> > 
> > On Mon 26-11-12 01:38:55, azurIt wrote:
> > > >This is hackish but it should help you in this case. Kamezawa, what do
> > > >you think about that? Should we generalize this and prepare something
> > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > > >automatically and use the function whenever we are in a locked context?
> > > >To be honest I do not like this very much but nothing more sensible
> > > >(without touching non-memcg paths) comes to my mind.
> > > 
> > > 
> > > I installed kernel with this patch, will report back if problem occurs
> > > again OR in few weeks if everything will be ok. Thank you!
> > 
> > Now that I am looking at the patch closer it will not work because it
> > depends on other patch which is not merged yet and even that one would
> > help on its own because __GFP_NORETRY doesn't break the charge loop.
> > Sorry I have missed that...
> > 
> > The patch bellow should help though. (it is based on top of the current
> > -mm tree but I will send a backport to 3.2 in the reply as well)
> > ---
> > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> > 
> > memcg oom killer might deadlock if the process which falls down to
> > mem_cgroup_handle_oom holds a lock which prevents other task to
> > terminate because it is blocked on the very same lock.
> > This can happen when a write system call needs to allocate a page but
> > the allocation hits the memcg hard limit and there is nothing to reclaim
> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > have been reclaimed already) and the process selected by memcg OOM
> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> > 
> > Process A
> > [] do_truncate+0x58/0xa0  # takes i_mutex
> > [] do_last+0x250/0xa30
> > [] path_openat+0xd7/0x440
> > [] do_filp_open+0x49/0xa0
> > [] do_sys_open+0x106/0x240
> > [] sys_open+0x20/0x30
> > [] system_call_fastpath+0x18/0x1d
> > [] 0x
> > 
> > Process B
> > [] mem_cgroup_handle_oom+0x241/0x3b0
> > [] T.1146+0x5ab/0x5c0
> > [] mem_cgroup_cache_charge+0xbe/0xe0
> > [] add_to_page_cache_locked+0x4c/0x140
> > [] add_to_page_cache_lru+0x22/0x50
> > [] grab_cache_page_write_begin+0x8b/0xe0
> > [] ext3_write_begin+0x88/0x270
> > [] generic_file_buffered_write+0x116/0x290
> > [] __generic_file_aio_write+0x27c/0x480
> > [] generic_file_aio_write+0x76/0xf0   # takes 
> > ->i_mutex
> > [] do_sync_write+0xea/0x130
> > [] vfs_write+0xf3/0x1f0
> > [] sys_write+0x51/0x90
> > [] system_call_fastpath+0x18/0x1d
> > [] 0x
> 
> So process B manages to lock the hierarchy, calls
> mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
> for task A to die.  All while it holds the i_mutex, preventing task A
> from dying, right?

Right.

> I think global oom already handles this in a much better way: invoke
> the OOM killer, sleep for a second, then return to userspace to
> relinquish all kernel resources and locks.  The only reason why we
> can't simply change from an endless retry loop is because we don't
> want to return VM_FAULT_OOM and invoke the global OOM killer.

Exactly.

> But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> respectively.  This way, the memcg OOM killer is invoked as it should
> but nobody gets stuck anywhere livelocking with the exiting task.

Hmm, we would still have a problem with oom disabled (aka user space OOM
killer), right? All processes but those in mem_cgroup_handle_oom are
risky to be killed.
Other POV might be, why we should trigger an OOM killer from those paths
in the first place. Write or read (or even readahead) are all calls that
should rather fail than cause an OOM killer in my opinion.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

rework mem_cgroup iterator

2012-11-26 Thread Michal Hocko

Hi all,
this is a second version of the patchset previously posted here:
https://lkml.org/lkml/2012/11/13/335

The patch set tries to make mem_cgroup_iter saner in the way how it
walks hierarchies. css->id based traversal is far from being ideal as it
is not deterministic because it depends on the creation ordering.

Diffstat looks promising but it is fair the say that the biggest cleanup is
just css_get_next removal. The memcg code has grown a bit but I think it is
worth the resulting outcome (the sanity ;)).

The first patch fixes a potential misbehaving which I haven't seen but the
fix is needed for the later patches anyway. We could take it alone as well
but I do not have any bug report to base the fix on. The second one is also
preparatory and it is new to the series.

The third patch replaces css_get_next by cgroup iterators which are
scheduled for 3.8 in Tejun's tree and I depend on the following two patches:
fe1e904c cgroup: implement generic child / descendant walk macros
7e187c6c cgroup: use rculist ops for cgroup->children

The third and fourth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is
moved to a helper so that the reference counting and iteration are
separated.

The last patch just removes css_get_next as there is no user for it any
longer.

I am also thinking that leaf-to-root iteration makes more sense but this
patch is not included in the series yet because I have to think some
more about the justification.

I have dropped "[RFC 4/5] memcg: clean up mem_cgroup_iter"
(https://lkml.org/lkml/2012/11/13/333) because testing quickly shown
that my thinking was flawed and VM_BUG_ON(!prev && !memcg) triggered
very quickly. This also suggest that this version has seen some testing,
unlike the previous one ;)
The test was simple but I guess it exercised this code path quite heavily.
A (limit = 280M, use_hierarchy=true)
  / | \
 B  C  D (all have 100M limit)

and independent kernel build (with the full distribution config) in
all children groups. This triggers both children only and hierarchical
reclaims.

The shortlog says:
Michal Hocko (6):
  memcg: synchronize per-zone iterator access by a spinlock
  memcg: keep prev's css alive for the whole mem_cgroup_iter
  memcg: rework mem_cgroup_iter to use cgroup iterators
  memcg: simplify mem_cgroup_iter
  memcg: further simplify mem_cgroup_iter
  cgroup: remove css_get_next

And diffstat:
 include/linux/cgroup.h |7 ---
 kernel/cgroup.c|   49 -
 mm/memcontrol.c|  110 +---
 3 files changed, 86 insertions(+), 80 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch v2 2/6] memcg: keep prev's css alive for the whole mem_cgroup_iter

2012-11-26 Thread Michal Hocko

css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break.
mem_cgroup_iter currently releases the reference right after it gets the
last css_id.
This is correct because neither prev's memcg nor cgroup are accessed
after then. This will change in the next patch so we need to hold the
group alive a bit longer so let's move the css_put at the end of the
function.

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0e52ec9..1f5528d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1077,12 +1077,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
if (prev && !reclaim)
id = css_id(&prev->css);
 
-   if (prev && prev != root)
-   css_put(&prev->css);
-
if (!root->use_hierarchy && root != root_mem_cgroup) {
if (prev)
-   return NULL;
+   goto out_css_put;
return root;
}
 
@@ -1100,7 +1097,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
spin_lock(&iter->iter_lock);
if (prev && reclaim->generation != iter->generation) {
spin_unlock(&iter->iter_lock);
-   return NULL;
+   goto out_css_put;
}
id = iter->position;
}
@@ -1124,8 +1121,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
}
 
if (prev && !css)
-   return NULL;
+   goto out_css_put;
}
+out_css_put:
+   if (prev && prev != root)
+   css_put(&prev->css);
+
return memcg;
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch v2 1/6] memcg: synchronize per-zone iterator access by a spinlock

2012-11-26 Thread Michal Hocko

per-zone per-priority iterator is aimed at coordinating concurrent
reclaimers on the same hierarchy (or the global reclaim when all
groups are reclaimed) so that all groups get reclaimed evenly as
much as possible. iter->position holds the last css->id visited
and iter->generation signals the completed tree walk (when it is
incremented).
Concurrent reclaimers are supposed to provide a reclaim cookie which
holds the reclaim priority and the last generation they saw. If cookie's
generation doesn't match the iterator's view then other concurrent
reclaimer already did the job and the tree walk is done for that
priority.

This scheme works nicely in most cases but it is not raceless. Two
racing reclaimers can see the same iter->position and so bang on the
same group. iter->generation increment is not serialized as well so a
reclaimer can see an updated iter->position with and old generation so
the iteration might be restarted from the root of the hierarchy.

The simplest way to fix this issue is to synchronise access to the
iterator by a lock. This implementation uses per-zone per-priority
spinlock which linearizes only directly racing reclaimers which use
reclaim cookies so the effect of the new locking should be really
minimal.

I have to note that I haven't seen this as a real issue so far. The
primary motivation for the change is different. The following patch
will change the way how the iterator is implemented and css->id
iteration will be replaced cgroup generic iteration which requires
storing mem_cgroup pointer into iterator and that requires reference
counting and so concurrent access will be a problem.

Signed-off-by: Michal Hocko 
Acked-by: KAMEZAWA Hiroyuki 
---
 mm/memcontrol.c |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..0e52ec9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -148,6 +148,8 @@ struct mem_cgroup_reclaim_iter {
int position;
/* scan generation, increased every round-trip */
unsigned int generation;
+   /* lock to protect the position and generation */
+   spinlock_t iter_lock;
 };
 
 /*
@@ -1095,8 +1097,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
 
mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
-   if (prev && reclaim->generation != iter->generation)
+   spin_lock(&iter->iter_lock);
+   if (prev && reclaim->generation != iter->generation) {
+   spin_unlock(&iter->iter_lock);
return NULL;
+   }
id = iter->position;
}
 
@@ -1115,6 +1120,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
+   spin_unlock(&iter->iter_lock);
}
 
if (prev && !css)
@@ -5880,8 +5886,12 @@ static int alloc_mem_cgroup_per_zone_info(struct 
mem_cgroup *memcg, int node)
return 1;
 
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+   int prio;
+
mz = &pn->zoneinfo[zone];
lruvec_init(&mz->lruvec);
+   for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
+   spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
mz->usage_in_excess = 0;
mz->on_tree = false;
mz->memcg = memcg;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch v2 6/6] cgroup: remove css_get_next

2012-11-26 Thread Michal Hocko

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko 
---
 include/linux/cgroup.h |7 ---
 kernel/cgroup.c|   49 
 2 files changed, 56 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 329eb46..ba46041 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -676,13 +676,6 @@ void free_css_id(struct cgroup_subsys *ss, struct 
cgroup_subsys_state *css);
 
 struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
 
-/*
- * Get a cgroup whose id is greater than or equal to id under tree of root.
- * Returning a cgroup_subsys_state or NULL.
- */
-struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
-   struct cgroup_subsys_state *root, int *foundid);
-
 /* Returns true if root is ancestor of cg */
 bool css_is_ancestor(struct cgroup_subsys_state *cg,
 const struct cgroup_subsys_state *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d51958a..4d874b2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5230,55 +5230,6 @@ struct cgroup_subsys_state *css_lookup(struct 
cgroup_subsys *ss, int id)
 }
 EXPORT_SYMBOL_GPL(css_lookup);
 
-/**
- * css_get_next - lookup next cgroup under specified hierarchy.
- * @ss: pointer to subsystem
- * @id: current position of iteration.
- * @root: pointer to css. search tree under this.
- * @foundid: position of found object.
- *
- * Search next css under the specified hierarchy of rootid. Calling under
- * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
- */
-struct cgroup_subsys_state *
-css_get_next(struct cgroup_subsys *ss, int id,
-struct cgroup_subsys_state *root, int *foundid)
-{
-   struct cgroup_subsys_state *ret = NULL;
-   struct css_id *tmp;
-   int tmpid;
-   int rootid = css_id(root);
-   int depth = css_depth(root);
-
-   if (!rootid)
-   return NULL;
-
-   BUG_ON(!ss->use_id);
-   WARN_ON_ONCE(!rcu_read_lock_held());
-
-   /* fill start point for scan */
-   tmpid = id;
-   while (1) {
-   /*
-* scan next entry from bitmap(tree), tmpid is updated after
-* idr_get_next().
-*/
-   tmp = idr_get_next(&ss->idr, &tmpid);
-   if (!tmp)
-   break;
-   if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
-   ret = rcu_dereference(tmp->css);
-   if (ret) {
-   *foundid = tmpid;
-   break;
-   }
-   }
-   /* continue to scan from next id */
-   tmpid = tmpid + 1;
-   }
-   return ret;
-}
-
 /*
  * get corresponding css from file open on cgroupfs directory
  */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch v2 5/6] memcg: further simplify mem_cgroup_iter

2012-11-26 Thread Michal Hocko

mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk).
The code would be much more easier to follow if we move the iteration
outside of the function (to __mem_cgrou_iter_next) so the distinction
is more clear.
This patch doesn't introduce any functional changes.

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   79 ---
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1bc0e8..a5018bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1044,6 +1044,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct 
mm_struct *mm)
return memcg;
 }
 
+/*
+ * Returns a next (in a pre-order walk) alive memcg (with elevated css
+ * ref. count) or NULL if the whole root's subtree has been visited.
+ *
+ * helper function to be used by mem_cgroup_iter
+ */
+static struct mem_cgroup *__mem_cgrou_iter_next(struct mem_cgroup *root,
+   struct mem_cgroup *last_visited)
+{
+   struct cgroup *prev_cgroup, *next_cgroup;
+
+   /*
+* Root is not visited by cgroup iterators so it needs an
+* explicit visit.
+*/
+   if (!last_visited)
+   return root;
+
+   prev_cgroup = (last_visited == root) ? NULL
+   : last_visited->css.cgroup;
+skip_node:
+   next_cgroup = cgroup_next_descendant_pre(
+   prev_cgroup, root->css.cgroup);
+
+   /*
+* Even if we found a group we have to make sure it is
+* alive. css && !memcg means that the groups should be
+* skipped and we should continue the tree walk.
+* last_visited css is safe to use because it is
+* protected by css_get and the tree walk is rcu safe.
+*/
+   if (next_cgroup) {
+   struct mem_cgroup *mem = mem_cgroup_from_cont(
+   next_cgroup);
+   if (css_tryget(&mem->css))
+   return mem;
+   else {
+   prev_cgroup = next_cgroup;
+   goto skip_node;
+   }
+   }
+
+   return NULL;
+}
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -1106,39 +1151,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
}
}
 
-   /*
-* Root is not visited by cgroup iterators so it needs an
-* explicit visit.
-*/
-   if (!last_visited) {
-   memcg = root;
-   } else {
-   struct cgroup *prev_cgroup, *next_cgroup;
-
-   prev_cgroup = (last_visited == root) ? NULL
-   : last_visited->css.cgroup;
-skip_node:
-   next_cgroup = cgroup_next_descendant_pre(
-   prev_cgroup, root->css.cgroup);
-
-   /*
-* Even if we found a group we have to make sure it is
-* alive. css && !memcg means that the groups should be
-* skipped and we should continue the tree walk.
-* last_visited css is safe to use because it is
-* protected by css_get and the tree walk is rcu safe.
-*/
-   if (next_cgroup) {
-   struct mem_cgroup *mem = mem_cgroup_from_cont(
-   next_cgroup);
-   if (css_tryget(&mem->css))
-   memcg = mem;
-   else {
-   prev_cgroup = next_cgroup;
-   goto skip_node;
-   }
-   }
-   }
+   memcg = __mem_cgrou_iter_next(root, last_visited);
 
if (reclaim) {
if (last_visited)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

2012-11-26 Thread Michal Hocko

mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
 1) mkdir -p a a/d a/b/c
 2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
 1) a, d, b, c
 2) a, b, c, d

574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.

cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is fully initialized in mem_cgroup_create and css reference
counting enforces that the group is alive for both the last seen cgroup
and the found one resp. it signals that the group is dead and it should
be skipped.

If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).

V2
- use css_{get,put} for iter->last_visited rather than
  mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
- cgroup_next_descendant_pre expects NULL pos for the first iterartion
  otherwise it might loop endlessly for intermediate node without any
  children.

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   74 ++-
 1 file changed, 57 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f5528d..6bcc97b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,8 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-   /* css_id of the last scanned hierarchy member */
-   int position;
+   /* last scanned hierarchy member with elevated css ref count */
+   struct mem_cgroup *last_visited;
/* scan generation, increased every round-trip */
unsigned int generation;
/* lock to protect the position and generation */
@@ -1066,7 +1066,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
   struct mem_cgroup_reclaim_cookie *reclaim)
 {
struct mem_cgroup *memcg = NULL;
-   int id = 0;
+   struct mem_cgroup *last_visited = NULL;
 
if (mem_cgroup_disabled())
return NULL;
@@ -1075,7 +1075,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
root = root_mem_cgroup;
 
if (prev && !reclaim)
-   id = css_id(&prev->css);
+   last_visited = prev;
 
if (!root->use_hierarchy && root != root_mem_cgroup) {
if (prev)
@@ -1083,9 +1083,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
return root;
}
 
+   rcu_read_lock();
while (!memcg) {
struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *css = NULL;
 
if (reclaim) {
int nid = zone_to_nid(reclaim->zone);
@@ -1095,34 +1096,73 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
spin_lock(&iter->iter_lock);
+   last_visited = iter->last_visited;
if (prev && reclaim->generation != iter->generation) {
+   if (last_visited) {
+   css_put(&last_visited->css);
+   iter->last_visited = NULL;
+   }
spin_unlock(&iter->iter_lock);
-   goto out_css_put;
+   goto out_unlock;
}
-   id = iter->position;
}
 
-   rcu_read_lock();
-   css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
-   if (css) {
-   if (css == &root->css || css_tryget(css))
-   memcg = mem_cgroup_from_css(css);
-   } else
-   id = 0;
-   rcu_read_unlock();
+   /*
+* Root

[patch v2 4/6] memcg: simplify mem_cgroup_iter

2012-11-26 Thread Michal Hocko

Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.

It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.

We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   56 +++
 1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6bcc97b..d1bc0e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1086,7 +1086,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
rcu_read_lock();
while (!memcg) {
struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-   struct cgroup_subsys_state *css = NULL;
 
if (reclaim) {
int nid = zone_to_nid(reclaim->zone);
@@ -1112,53 +,52 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
 * explicit visit.
 */
if (!last_visited) {
-   css = &root->css;
+   memcg = root;
} else {
struct cgroup *prev_cgroup, *next_cgroup;
 
prev_cgroup = (last_visited == root) ? NULL
: last_visited->css.cgroup;
-   next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
-   root->css.cgroup);
-   if (next_cgroup)
-   css = cgroup_subsys_state(next_cgroup,
-   mem_cgroup_subsys_id);
-   }
+skip_node:
+   next_cgroup = cgroup_next_descendant_pre(
+   prev_cgroup, root->css.cgroup);
 
-   /*
-* Even if we found a group we have to make sure it is alive.
-* css && !memcg means that the groups should be skipped and
-* we should continue the tree walk.
-* last_visited css is safe to use because it is protected by
-* css_get and the tree walk is rcu safe.
-*/
-   if (css == &root->css || (css && css_tryget(css)))
-   memcg = mem_cgroup_from_css(css);
+   /*
+* Even if we found a group we have to make sure it is
+* alive. css && !memcg means that the groups should be
+* skipped and we should continue the tree walk.
+* last_visited css is safe to use because it is
+* protected by css_get and the tree walk is rcu safe.
+*/
+   if (next_cgroup) {
+   struct mem_cgroup *mem = mem_cgroup_from_cont(
+   next_cgroup);
+   if (css_tryget(&mem->css))
+   memcg = mem;
+   else {
+   prev_cgroup = next_cgroup;
+   goto skip_node;
+   }
+   }
+   }
 
if (reclaim) {
-   struct mem_cgroup *curr = memcg;
-
if (last_visited)
css_put(&last_visited->css);
 
-   if (css && !memcg)
-   curr = mem_cgroup_from_css(css);
-
/* make sure that the cached memcg is not removed */
-   if (curr)
-   css_get(&curr->css);
-   iter->last_visited = curr;
+   if (memcg)
+   css_get(&memcg->css);
+   iter->last_visited = memcg;
 
-   if (!css)
+   if (!memcg)
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
spin_unlock(&ite

Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-26 Thread Michal Hocko

On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
[...]
> > > I think global oom already handles this in a much better way: invoke
> > > the OOM killer, sleep for a second, then return to userspace to
> > > relinquish all kernel resources and locks.  The only reason why we
> > > can't simply change from an endless retry loop is because we don't
> > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > 
> > Exactly.
> > 
> > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> > > respectively.  This way, the memcg OOM killer is invoked as it should
> > > but nobody gets stuck anywhere livelocking with the exiting task.
> > 
> > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > killer), right? All processes but those in mem_cgroup_handle_oom are
> > risky to be killed.
> 
> Could we still let everybody get stuck in there when the OOM killer is
> disabled and let userspace take care of it?

I am not sure what exactly you mean by "userspace take care of it" but
if those processes are stuck and holding the lock then it is usually
hard to find that out. Well if somebody is familiar with internal then
it is doable but this makes the interface really unusable for regular
usage.

> > Other POV might be, why we should trigger an OOM killer from those paths
> > in the first place. Write or read (or even readahead) are all calls that
> > should rather fail than cause an OOM killer in my opinion.
> 
> Readahead is arguable, but we kill globally for read() and write() and
> I think we should do the same for memcg.

Fair point but the global case is little bit easier than memcg in this
case because nobody can hook on OOM killer and provide a userspace
implementation for it which is one of the cooler feature of memcg...
I am all open to any suggestions but we should somehow fix this (and
backport it to stable trees as this is there for quite some time. The
current report shows that the problem is not that hard to trigger).

> The OOM killer is there to resolve a problem that comes from
> overcommitting the machine but the overuse does not have to be from
> the application that pushes the machine over the edge, that's why we
> don't just kill the allocating task but actually go look for the best
> candidate.  If you have one memory hog that overuses the resources,
> attempted memory consumption in a different program should invoke the
> OOM killer.  

> It does not matter if this is a page fault (would still happen with
> your patch) or a bufferd read/write (would no longer happen).

true and it is sad that mmap then behaves slightly different than
read/write which should I've mentioned in the changelog. As I said I am
open to other suggestions.

Thanks
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-26 Thread Michal Hocko

On Mon 26-11-12 14:29:41, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote:
> > On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > [...]
> > > > > I think global oom already handles this in a much better way: invoke
> > > > > the OOM killer, sleep for a second, then return to userspace to
> > > > > relinquish all kernel resources and locks.  The only reason why we
> > > > > can't simply change from an endless retry loop is because we don't
> > > > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > > > 
> > > > Exactly.
> > > > 
> > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > > > just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> > > > > respectively.  This way, the memcg OOM killer is invoked as it should
> > > > > but nobody gets stuck anywhere livelocking with the exiting task.
> > > > 
> > > > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > > > killer), right? All processes but those in mem_cgroup_handle_oom are
> > > > risky to be killed.
> > > 
> > > Could we still let everybody get stuck in there when the OOM killer is
> > > disabled and let userspace take care of it?
> > 
> > I am not sure what exactly you mean by "userspace take care of it" but
> > if those processes are stuck and holding the lock then it is usually
> > hard to find that out. Well if somebody is familiar with internal then
> > it is doable but this makes the interface really unusable for regular
> > usage.
> 
> If oom_kill_disable is set, then all processes get stuck all the way
> down in the charge stack.  Whatever resource they pin, you may
> deadlock on if you try to touch it while handling the problem from
> userspace.

OK, I guess I am getting what you are trying to say. So what you are
suggesting is to just let mem_cgroup_out_of_memory send the signal and
move on without retry (or with few charge retries without further OOM
killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
something like FAULT_RETRY) error code resp. ENOMEM depending on the
caller.  OOM disabled case would be "you are on your own" because this
has been dangerous anyway. Correct?
I do agree that the current endless retry loop is far from being ideal
and can see some updates but I am quite nervous about any potential
regressions in this area (e.g. too aggressive OOM etc...). I have to
think about it some more.
Anyway if you have some more specific ideas I would be happy to review
patches.

[...]

Thanks
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

2012-11-26 Thread Michal Hocko

[Sorry to jump in that late]

On Tue 20-11-12 10:02:45, David Rientjes wrote:
> On Mon, 19 Nov 2012, Glauber Costa wrote:
> 
> > >> In the case I outlined below, for backwards compatibility. What I
> > >> actually mean is that memcg *currently* allows arbitrary notifications.
> > >> One way to merge those, while moving to a saner 3-point notification, is
> > >> to still allow the old writes and fit them in the closest bucket.
> > >>
> > > 
> > > Yeah, but I'm wondering why three is the right answer.
> > > 
> > 
> > This is unrelated to what I am talking about.
> > I am talking about pre-defined values with a specific event meaning (in
> > his patchset, 3) vs arbitrary numbers valued in bytes.
> > 
> 
> Right, and I don't see how you can map the memcg thresholds onto Anton's 
> scheme that heavily relies upon reclaim activity; what bucket does a 
> threshold of 48MB in a memcg with a limit of 64MB fit into?
> Perhaps you have some formula in mind that would do this, but I don't
> see how it works correctly without factoring in configuration options
> (memory compaction), type of allocation (GFP_ATOMIC won't trigger
> Anton's reclaim scheme like GFP_KERNEL), altered min_free_kbytes, etc.
> 
> This begs the question of whether the new cgroup should be considered as a 
> replacement for memory thresholds within memcg in the first place; 
> certainly both can coexist just fine.

Absolutely agreed. Yes those two things are inherently different.
Information that "you have passed half of your limit" is something
totally different than "you should slow down". Although I am not
entirely sure what the first is one good for (to be honest), but I
believe there are users out there.

I do not think that mixing those two makes much sense. They have
different usecases and until we have users for the thresholds one we
should keep it.

[...]

Thanks
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-26 Thread Michal Hocko

On Mon 26-11-12 15:19:18, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote:
[...]
> > OK, I guess I am getting what you are trying to say. So what you are
> > suggesting is to just let mem_cgroup_out_of_memory send the signal and
> > move on without retry (or with few charge retries without further OOM
> > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
> > something like FAULT_RETRY) error code resp. ENOMEM depending on the
> > caller.  OOM disabled case would be "you are on your own" because this
> > has been dangerous anyway. Correct?
> 
> Yes.
> 
> > I do agree that the current endless retry loop is far from being ideal
> > and can see some updates but I am quite nervous about any potential
> > regressions in this area (e.g. too aggressive OOM etc...). I have to
> > think about it some more.
> 
> Agreed on all points.  Maybe we can keep a couple of the oom retry
> iterations or something like that, which is still much more than what
> global does and I don't think the global OOM killer is overly eager.

Yes we can offer less blood and more confort

> 
> Testing will show more.
> 
> > Anyway if you have some more specific ideas I would be happy to review
> > patches.
> 
> Okay, I just wanted to check back with you before going down this
> path.  What are we going to do short term, though?  Do you want to
> push the disable-oom-for-pagecache for now or should we put the
> VM_FAULT_OOM_HANDLED fix in the next version and do stable backports?
> 
> This issue has been around for a while so frankly I don't think it's
> urgent enough to rush things.

Yes, but now we have a real usecase where this hurts AFAIU. Unless
we come up with a fix/reasonable workaround I would rather go with
something simpler for starter and more sofisticated later.

I have to double check other places where we do charging but the last
time I've checked we don't hold page locks on already visible pages (we
do precharge in __do_fault f.e.), mem_map for reading in the page fault
path is also safe (with oom enabled) and I guess that tmpfs is ok as
well. Then we have a page cache and that one should be covered by my
patch. So we should be covered.

But I like your idea long term.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mm/vmemmap: fix wrong use of virt_to_page

2012-11-27 Thread Michal Hocko

On Tue 27-11-12 10:17:13, Jianguo Wu wrote:
> I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing memory 
> hotremove,
> there is a kernel BUG at arch/x86/mm/physaddr.c:20.
> 
> It is caused by free_section_usemap()->virt_to_page(),
> virt_to_page() is only used for kernel direct mapping address,
> but sparse-vmemmap uses vmemmap address, so it is going wrong here.
> 
> [  517.727381] [ cut here ]
> [  517.728851] kernel BUG at arch/x86/mm/physaddr.c:20!
> [  517.728851] invalid opcode:  [#1] SMP
> [  517.740170] Modules linked in: acpihp_drv acpihp_slot edd 
> cpufreq_conservativ
> e cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop 
> dm_m
> od coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core 
> pcspk
> r iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich 
> mfd_c
> ore mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button 
> ext3
>  jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod 
> crc_t10dif p
> rocessor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac 
> scsi_dh_emc s
> csi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
> [  517.740170] CPU 39
> [  517.740170] Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 
> QCI Q
> SSC-S4R/QSSC-S4R
> [  517.740170] RIP: 0010:[]  [] 
> __phys_addr+
> 0x88/0x90
> [  517.740170] RSP: 0018:8804440d7c08  EFLAGS: 00010006
> [  517.740170] RAX: 0006 RBX: ea001200 RCX: 
> 002c
> 
> [  517.740170] RDX: 62001200 RSI:  RDI: 
> ea001200
> 
> [  517.740170] RBP: 8804440d7c08 R08: 00700400 R09: 
> 00488000
> 
> [  517.740170] R10: 0091 R11: 0001 R12: 
> 88047fb87800
> 
> [  517.740170] R13: ea00 R14: 88047ffb3440 R15: 
> 0048
> 
> [  517.740170] FS:  7f0462b49700() GS:8804570c() 
> knlGS:0
> 000
> [  517.740170] CS:  0010 DS:  ES:  CR0: 8005003b
> [  517.740170] CR2: 7f006dc5fd14 CR3: 000440e85000 CR4: 
> 07e0
> 
> [  517.740170] DR0: 0000 DR1:  DR2: 
> 
> 
> [  517.896799] DR3:  DR6
> 
> Signed-off-by: Jianguo Wu 
> Signed-off-by: Jiang Liu 

Reviewed-by: Michal Hocko 

Hmm, this is quite old. AFAIU this has been introduced by 0c0a4a51
(memory hotplug: free memmaps allocated by bootmem) introduced in 2.6.26
so this should be marked for stable.

Thanks!

> ---
>  mm/sparse.c |   10 --
>  1 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index fac95f2..a83de2f 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -617,7 +617,7 @@ static void __kfree_section_memmap(struct page *memmap, 
> unsigned long nr_pages)
>  {
>   return; /* XXX: Not implemented yet */
>  }
> -static void free_map_bootmem(struct page *page, unsigned long nr_pages)
> +static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
>  {
>  }
>  #else
> @@ -658,10 +658,11 @@ static void __kfree_section_memmap(struct page *memmap, 
> unsigned long nr_pages)
>  get_order(sizeof(struct page) * nr_pages));
>  }
>  
> -static void free_map_bootmem(struct page *page, unsigned long nr_pages)
> +static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
>  {
>   unsigned long maps_section_nr, removing_section_nr, i;
>   unsigned long magic;
> + struct page *page = virt_to_page(memmap);
>  
>   for (i = 0; i < nr_pages; i++, page++) {
>   magic = (unsigned long) page->lru.next;
> @@ -710,13 +711,10 @@ static void free_section_usemap(struct page *memmap, 
> unsigned long *usemap)
>*/
>  
>   if (memmap) {
> - struct page *memmap_page;
> - memmap_page = virt_to_page(memmap);
> -
>   nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
>   >> PAGE_SHIFT;
>  
> - free_map_bootmem(memmap_page, nr_pages);
> + free_map_bootmem(memmap, nr_pages);
>   }
>  }
>  
> -- 
> 1.7.1
> 
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-27 Thread Michal Hocko

On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote:
[...]
> As a short term fix, I think this patch will work enough and seems simple 
> enough.
> Acked-by: KAMEZAWA Hiroyuki 

Thanks!
If Johannes is also ok with this for now I will resubmit the patch to
Andrew after I hear back from the reporter.
 
> Reading discussion between you and Johannes, to release locks, I understand
> the memcg need to return "RETRY" for a long term fix. Thinking a little,
> it will be simple to return "RETRY" to all processes waited on oom kill queue
> of a memcg and it can be done by a small fixes to memory.c.

I wouldn't call it simple but it is doable.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-27 Thread Michal Hocko

On Tue 27-11-12 14:48:13, Johannes Weiner wrote:
[...]
> > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > >index 10e667f..aac9b21 100644
> > >--- a/include/linux/gfp.h
> > >+++ b/include/linux/gfp.h
> > >@@ -152,6 +152,9 @@ struct vm_area_struct;
> > >  /* 4GB DMA on some platforms */
> > >  #define GFP_DMA32__GFP_DMA32
> > >
> > >+/* memcg oom killer is not allowed */
> > >+#define GFP_MEMCG_NO_OOM  __GFP_NORETRY
> 
> Could we leave this within memcg, please?  An extra flag to
> mem_cgroup_cache_charge() or the like.  GFP flags are about
> controlling the page allocator, this seems abusive.  We have an oom
> flag down in try_charge, maybe just propagate this up the stack?

OK, what about the patch bellow?
I have dropped Kame's Acked-by because it has been reworked. The patch
is the same in principle.

> > >diff --git a/mm/filemap.c b/mm/filemap.c
> > >index 83efee7..ef14351 100644
> > >--- a/mm/filemap.c
> > >+++ b/mm/filemap.c
> > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, 
> > >struct address_space *mapping,
> > >   VM_BUG_ON(!PageLocked(page));
> > >   VM_BUG_ON(PageSwapBacked(page));
> > >
> > >-  error = mem_cgroup_cache_charge(page, current->mm,
> > >+  /*
> > >+   * Cannot trigger OOM even if gfp_mask would allow that normally
> > >+   * because we might be called from a locked context and that
> > >+   * could lead to deadlocks if the killed process is waiting for
> > >+   * the same lock.
> > >+   */
> > >+  error = mem_cgroup_cache_charge_no_oom(page, current->mm,
> > >   gfp_mask & GFP_RECLAIM_MASK);
> > >   if (error)
> > >   goto out;
> 
> Shmem does not use this function but also charges under the i_mutex in
> the write path and fallocate at least.

Right you are
---
>From 60cc8a184490d277eb24fca551b114f1e2234ce0 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

Process B
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
  Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt 
Signed-off-by: Michal Hocko 
---
 include/linux/memcontrol.h |5 +++--
 mm/filemap.c   |9 +++--
 mm/memcontrol.c|9 -
 mm/shmem.c |   14 +++---
 4 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
 extern int mem_cgroup_cache_

Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-27 Thread Michal Hocko

Sorry, forgot to about one shmem charge:
---
>From 7ae29927d24471c1b1a6ceb021219c592c1ef518 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 27 Nov 2012 21:53:13 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

Process B
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
  Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt 
Signed-off-by: Michal Hocko 
---
 include/linux/memcontrol.h |5 +++--
 mm/filemap.c   |9 +++--
 mm/memcontrol.c|9 -
 mm/shmem.c |   15 ---
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-   gfp_t gfp_mask);
+   gfp_t gfp_mask, bool oom);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page 
*page,
 }
 
 static inline int mem_cgroup_cache_charge(struct page *page,
-   struct mm_struct *mm, gfp_t gfp_mask)
+   struct mm_struct *mm, gfp_t gfp_mask,
+   bool oom)
 {
return 0;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct 
address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));
 
-   error = mem_cgroup_cache_charge(page, current->mm,
-   gfp_mask & GFP_RECLAIM_MASK);
+   /*
+* Cannot trigger OOM even if gfp_mask would allow that normally
+* because we might be called from a locked context and that
+* could lead to deadlocks if the killed process is waiting for
+* the same lock.
+*/
+   error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
if (error)
goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..26690d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-   gfp_t gfp_mask, enum charge_type ctype)
+   gfp_t gfp_mask, enum charge_type ctype, bool 
oom)
 {
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
-   bool oom = true;
in

Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

2012-11-28 Thread Michal Hocko

On Wed 28-11-12 17:47:59, KAMEZAWA Hiroyuki wrote:
> (2012/11/27 3:47), Michal Hocko wrote:
[...]
> > +   /*
> > +* Even if we found a group we have to make sure it is alive.
> > +* css && !memcg means that the groups should be skipped and
> > +* we should continue the tree walk.
> > +* last_visited css is safe to use because it is protected by
> > +* css_get and the tree walk is rcu safe.
> > +*/
> > +   if (css == &root->css || (css && css_tryget(css)))
> > +   memcg = mem_cgroup_from_css(css);
> 
> Could you note that this iterator will never visit dangling(removed)
> memcg, somewhere ?

OK, I can add it to the function comment but the behavior hasn't changed
so I wouldn't like to confuse anybody.

> Hmm, I'm not sure but it may be trouble at shrkinking dangling
> kmem_cache(slab).

We do not shrink slab at all. Those objects that are in a dead memcg
wait for their owner tho release them which will make the dangling group
eventually go away

> 
> Costa, how do you think ?
> 
> I guess there is no problem with swap and not against the way you go.

Yes, swap should be OK. Pages charged against removed memcg will
fallback to the the current's mm (try_get_mem_cgroup_from_page and
__mem_cgroup_try_charge_swapin)

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

2012-11-28 Thread Michal Hocko

On Wed 28-11-12 13:23:57, Glauber Costa wrote:
> On 11/28/2012 01:17 PM, Michal Hocko wrote:
> > On Wed 28-11-12 17:47:59, KAMEZAWA Hiroyuki wrote:
> >> (2012/11/27 3:47), Michal Hocko wrote:
> > [...]
> >>> + /*
> >>> +  * Even if we found a group we have to make sure it is alive.
> >>> +  * css && !memcg means that the groups should be skipped and
> >>> +  * we should continue the tree walk.
> >>> +  * last_visited css is safe to use because it is protected by
> >>> +  * css_get and the tree walk is rcu safe.
> >>> +  */
> >>> + if (css == &root->css || (css && css_tryget(css)))
> >>> + memcg = mem_cgroup_from_css(css);
> >>
> >> Could you note that this iterator will never visit dangling(removed)
> >> memcg, somewhere ?
> > 
> > OK, I can add it to the function comment but the behavior hasn't changed
> > so I wouldn't like to confuse anybody.
> > 
> >> Hmm, I'm not sure but it may be trouble at shrkinking dangling
> >> kmem_cache(slab).
> > 
> > We do not shrink slab at all. 
> 
> yet. However...
> 
> > Those objects that are in a dead memcg
> > wait for their owner tho release them which will make the dangling group
> > eventually go away
> > 
> >>
> >> Costa, how do you think ?
> >>
> 
> In general, I particularly believe it is a good idea to skip dead memcgs
> in the iterator. I don't anticipate any problems with shrinking at all.

We even cannot iterate dead ones because their cgroups are gone and so
you do not have any way to iterate. So either make them alive by css_get
or we cannot iterate them.

> Basically, we will only ever actively shrink when the memcg is dying, at
> which point it is still alive.
> After this, it's better to just leave it to vmscan. Whenever there is
> pressure, it will go away.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-28 Thread Michal Hocko

On Wed 28-11-12 10:26:31, Johannes Weiner wrote:
> On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote:
> > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct 
> > mm_struct *mm,
> > return 0;
> >  
> > if (!PageSwapCache(page))
> > -   ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> > +   ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
> > else { /* page is swapcache/shmem */
> > ret = __mem_cgroup_try_charge_swapin(mm, page,
> >  gfp_mask, &memcg);
> 
> I think you need to pass it down the swapcache path too, as that is
> what happens when the shmem page written to is in swap and has been
> read into swapcache by the time of charging.

You are right, of course. I shouldn't send patches late in the evening
after staring to a crashdump for a good part of the day. /me ashamed.

> > @@ -1152,8 +1152,16 @@ repeat:
> > goto failed;
> > }
> >  
> > +/*
> > +  * Cannot trigger OOM even if gfp_mask would allow that
> > +  * normally because we might be called from a locked
> > +  * context (i_mutex held) if this is a write lock or
> > +  * fallocate and that could lead to deadlocks if the
> > +  * killed process is waiting for the same lock.
> > + */
> 
> Indentation broken?

c&p

> > error = mem_cgroup_cache_charge(page, current->mm,
> > -   gfp & GFP_RECLAIM_MASK);
> > +   gfp & GFP_RECLAIM_MASK,
> > +   sgp < SGP_WRITE);
> 
> The code tests for read-only paths a bunch of times using
> 
>   sgp != SGP_WRITE && sgp != SGP_FALLOC
> 
> Would probably be more consistent and more robust to use this here as
> well?

Yes my laziness. I was considering that but it was really long so I've
chosen the simpler way. But you are right that consistency is probably
better here

> > @@ -1209,7 +1217,8 @@ repeat:
> > SetPageSwapBacked(page);
> > __set_page_locked(page);
> > error = mem_cgroup_cache_charge(page, current->mm,
> > -   gfp & GFP_RECLAIM_MASK);
> > +   gfp & GFP_RECLAIM_MASK,
> > +       sgp < SGP_WRITE);
> 
> Same.
> 
> Otherwise, the patch looks good to me, thanks for persisting :)

Thanks for the throughout review.
Here we go with the fixed version.
---
>From 5000bf32c9c02fcd31d18e615300d8e7e7ef94a5 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 28 Nov 2012 16:49:46 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

Process B
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is s

Re: [RFC] Add mempressure cgroup

2012-11-28 Thread Michal Hocko

ck);
> +
> + if (mpc->eventfd)
> + ret = -EBUSY;

The current cgroup's core doesn't allow pre_destroy to fail anymore. The
code is marked for 3.8

[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 48550c6..430d8a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1877,6 +1877,8 @@ restart:
>   shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  sc, LRU_ACTIVE_ANON);
>  
> + vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed);
> +

I think this should already report to a proper group otherwise all the
global reclaim would go to a group where kswapd sits rather than to the
target group as I mentioned above (so it at least wouldn't work with a
co-mounted cases).

>   /* reclaim/compaction might need reclaim to continue */
>   if (should_continue_reclaim(lruvec, nr_reclaimed,
>   sc->nr_scanned - nr_scanned, sc))
> @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct 
> zonelist *zonelist,
>   count_vm_event(ALLOCSTALL);
>  
>   do {
> + vmpressure_prio(sc->priority);

Shouldn't this go into shrink_lruvec or somewhere at that level to catch
also kswapd low priorities? If you insist on the direct reclaim then you
should hook into __zone_reclaim as well.

>   sc->nr_scanned = 0;
>   aborted_reclaim = shrink_zones(zonelist, sc);

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-28 Thread Michal Hocko

On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 095d2b4..5abe441 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, 
> > struct mm_struct *mm,
> > gfp_t gfp_mask);
> >  /* for swap handling */
> >  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > -   struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > +   struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > +   bool oom);
> 
> Ok, now I feel almost bad for asking, but why the public interface,
> too?

Would it work out if I tell it was to double check that your review
quality is not decreased after that many revisions? :P

Incremental update and the full patch in the reply
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5abe441..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,8 +57,7 @@ extern int mem_cgroup_newpage_charge(struct page *page, 
struct mm_struct *mm,
gfp_t gfp_mask);
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-   struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
-   bool oom);
+   struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
 extern void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *memcg);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
@@ -218,8 +217,7 @@ static inline int mem_cgroup_cache_charge(struct page *page,
 }
 
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-   struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp,
-   bool oom)
+   struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02a6d70..3c9b1c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3789,8 +3789,7 @@ charge_cur_mm:
 }
 
 int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
-gfp_t gfp_mask, struct mem_cgroup **memcgp,
-bool oom)
+gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
*memcgp = NULL;
if (mem_cgroup_disabled())
@@ -3804,12 +3803,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, 
struct page *page,
if (!PageSwapCache(page)) {
int ret;
 
-   ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom);
+   ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true);
if (ret == -EINTR)
ret = 0;
return ret;
}
-   return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom);
+   return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true);
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
diff --git a/mm/memory.c b/mm/memory.c
index afad903..6891d3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
}
 
-   if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) {
+   if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
ret = VM_FAULT_OOM;
goto out_page;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8ec511e..2f8e429 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
int ret = 1;
 
if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
-GFP_KERNEL, &memcg, true)) {
+GFP_KERNEL, &memcg)) {
ret = -ENOMEM;
goto out_nolock;
}
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-28 Thread Michal Hocko

On Wed 28-11-12 17:46:40, Michal Hocko wrote:
> On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 095d2b4..5abe441 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page 
> > > *page, struct mm_struct *mm,
> > >   gfp_t gfp_mask);
> > >  /* for swap handling */
> > >  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > > + bool oom);
> > 
> > Ok, now I feel almost bad for asking, but why the public interface,
> > too?
> 
> Would it work out if I tell it was to double check that your review
> quality is not decreased after that many revisions? :P
> 
> Incremental update and the full patch in the reply
---
>From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 28 Nov 2012 17:46:32 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

Process B
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
  Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt 
Signed-off-by: Michal Hocko 
---
 include/linux/memcontrol.h |5 +++--
 mm/filemap.c   |9 +++--
 mm/memcontrol.c|   20 ++--
 mm/shmem.c |   17 ++---
 4 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-   gfp_t gfp_mask);
+   gfp_t gfp_mask, bool oom);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page 
*page,
 }
 
 static inline int mem_cgroup_cache_charge(struct page *page,
-   struct mm_struct *mm, gfp_t gfp_mask)
+   struct mm_struct *mm, gfp_t gfp_mask,
+   bool oom)
 {
return 0;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct 
ad

[PATCH] memcg: do not check for mm in mem_cgroup_count_vm_event disabled

2012-11-29 Thread Michal Hocko

On Wed 28-11-12 15:29:30, Hugh Dickins wrote:
> On Wed, 21 Nov 2012, Michal Hocko wrote:
> > On Tue 20-11-12 13:49:32, Andrew Morton wrote:
> > > On Mon, 19 Nov 2012 17:44:34 -0800 (PST)
> > > David Rientjes  wrote:
[...]
> > > > -void mem_cgroup_count_vm_event(struct mm_struct *mm, enum 
> > > > vm_event_item idx);
> > > > +void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum 
> > > > vm_event_item idx);
> > > > +static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > > > +enum vm_event_item idx)
> > > > +{
> > > > +   if (mem_cgroup_disabled() || !mm)
> > > > +   return;
> > > > +   __mem_cgroup_count_vm_event(mm, idx);
> > > > +}
> > > 
> > > Does the !mm case occur frequently enough to justify inlining it, or
> > > should that test remain out-of-line?
> > 
> > Now that you've asked about it I started looking around and I cannot see
> > how mm can ever be NULL. The condition is there since the very beginning
> > (456f998e memcg: add the pagefault count into memcg stats) but all the
> > callers are page fault handlers and those shouldn't have mm==NULL.
> > Or is there anything obvious I am missing?
> > 
> > Ying, the whole thread starts https://lkml.org/lkml/2012/11/19/545 but
> > the primary question is why we need !mm test for mem_cgroup_count_vm_event
> > at all.
> 
> Here's a guess: as Ying's 456f998e patch started out in akpm's tree,
> shmem.c was calling mem_cgroup_count_vm_event(current->mm, PGMAJFAULT).
> 
> Then I insisted that was inconsistent with how we usually account when
> one task touches another's address space, and rearranged it to work on
> vma->vm_mm instead.

Thanks Hugh!
 
> Done the original way, if the touching task were a kernel daemon (KSM's
> ksmd comes to my mind), then the current->mm could well have been NULL.
> 
> I agree with you that it looks redundant now.

Andrew could you please pick this up?
---
>From 619b1ab26c3e96944f6c60256cf7920671bafa5b Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 29 Nov 2012 14:20:58 +0100
Subject: [PATCH] memcg: do not check for mm in mem_cgroup_count_vm_event

mm given to mem_cgroup_count_vm_event cannot be NULL because the
function is either called from the page fault path or vma->vm_mm is
used. So the check can be dropped.

The check has been introduced by 456f998e (memcg: add the pagefault
count into memcg stats) because the originally proposed patch used
current->mm for shmem but this has been changed to vma->vm_mm later on
without the check being removed (thanks to Hugh for this recollection).

Signed-off-by: Michal Hocko 
---
 include/linux/memcontrol.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..0108a56 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -188,7 +188,7 @@ void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum 
vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 enum vm_event_item idx)
 {
-   if (mem_cgroup_disabled() || !mm)
+   if (mem_cgroup_disabled())
return;
__mem_cgroup_count_vm_event(mm, idx);
 }
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-29 Thread Michal Hocko

On Wed 28-11-12 12:20:44, Hugh Dickins wrote:
[...]
> Sorry, Michal, you've laboured hard on this: but I dislike it so much
> that I'm here overcoming my dread of entering an OOM-killer discussion,
> and the resultant deluge of unwelcome CCs for eternity afterwards.
> 
> I had been relying on Johannes to repeat his "This issue has been
> around for a while so frankly I don't think it's urgent enough to
> rush things", but it looks like I have to be the one to repeat it.

Well, the idea was to use this only as a temporal fix and come up with a
better solution without any hurry.

> Your analysis of azurIt's traces may well be correct, and this patch
> may indeed ameliorate the situation, and it's fine as something for
> azurIt to try and report on and keep in his tree; but I hope that
> it does not go upstream and to stable.
> 
> Why do I dislike it so much?  I suppose because it's both too general
> and too limited at the same time.
> 
> Too general in that it changes the behaviour on OOM for a large set
> of memcg charges, all those that go through add_to_page_cache_locked(),
> when only a subset of those have the i_mutex issue.

This is a fair point but the real fix which we were discussing with
Johannes would be even more risky for stable.

> If you're going to be that general, why not go further?  Leave the
> mem_cgroup_cache_charge() interface as is, make it not-OOM internally,
> no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c.  No other
> filesystem gets the benefit of those distinctions: isn't it better to
> keep it simple?  (And I can see a partial truncation case where shmem
> uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour
> is a non-issue, since swapoff invites itself to be killed anyway.)
> 
> Too limited in that i_mutex is just the held resource which azurIt's
> traces have led you to, but it's a general problem that the OOM-killed
> task might be waiting for a resource that the OOM-killing task holds.
> 
> I suspect that if we try hard enough (I admit I have not), we can find
> an example of such a potential deadlock for almost every memcg charge
> site.  mmap_sem? not as easy to invent a case with that as I thought,
> since it needs a down_write, and the typical page allocations happen
> with down_read, and I can't think of a process which does down_write
> on another's mm.
> 
> But i_mutex is always good, once you remember the case of write to
> file from userspace page which got paged out, so the fault path has
> to read it back in, while i_mutex is still held at the outer level.
> An unusual case?  Well, normally yes, but we're considering
> out-of-memory conditions, which may converge upon cases like this.
> 
> Wouldn't it be nice if I could be constructive?  But I'm sceptical
> even of Johannes's faith in what the global OOM killer would do:
> how does __alloc_pages_slowpath() get out of its "goto restart"
> loop, excepting the trivial case when the killer is the killed?

I am not sure I am following you here but the Johannes's idea was to
break out of the charge after a signal has been sent and the charge
still fails and either retry the fault or fail the allocation. I think
this should work but I am afraid that this needs some tuning (number of
retries f.e.) to prevent from too aggressive OOM or too many failurs.

Do we have any other possibilities to solve this issue? Or do you think
we should ignore the problem just because nobody complained for such a
long time?
Dunno, I think we should fix this with something less risky for now and
come up with a real fix after it sees sufficient testing.

> I wonder why this issue has hit azurIt and no other reporter?
> No swap plays a part in it, but that's not so unusual.
> 
> Yours glOOMily,
> Hugh

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v2 6/6] cgroup: remove css_get_next

2012-11-30 Thread Michal Hocko

On Fri 30-11-12 13:12:29, KAMEZAWA Hiroyuki wrote:
> (2012/11/27 3:47), Michal Hocko wrote:
> > Now that we have generic and well ordered cgroup tree walkers there is
> > no need to keep css_get_next in the place.
> > 
> > Signed-off-by: Michal Hocko 
> 
> Hm, then, the next think will be css_is_ancestor() etc..

Good point I thought that the only remaining part is swap accounting. It
is not directly with the iterators so I will send a separate patch.
Tejun said he has a replacement that could be used for the swap
accounting and then the whole css_id can be buried.

> 
> Acked-by: KAMEZAWA Hiroyuki 

Thanks for the review of the whole patchset, Kame!
I would like to hear back from Johannes and then resubmit the series to
Andrew.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

2012-11-30 Thread Michal Hocko

On Fri 30-11-12 12:21:32, KAMEZAWA Hiroyuki wrote:
> (2012/11/29 6:34), Tejun Heo wrote:
> > Hello, guys.
> > 
> > Depending on cgroup core locking - cgroup_mutex - is messy and makes
> > cgroup prone to locking dependency problems.  The current code already
> > has lock dependency loop - memcg nests get_online_cpus() inside
> > cgroup_mutex.  cpuset the other way around.
> > 
> > Regardless of the locking details, whatever is protecting cgroup has
> > inherently to be something outer to most other locking constructs.
> > cgroup calls into a lot of major subsystems which in turn have to
> > perform subsystem-specific locking.  Trying to nest cgroup
> > synchronization inside other locks isn't something which can work
> > well.
> > 
> > cgroup now has enough API to allow subsystems to implement their own
> > locking and cgroup_mutex is scheduled to be made private to cgroup
> > core.  This patchset makes cpuset implement its own locking instead of
> > relying on cgroup_mutex.
> > 
> > cpuset is rather nasty in this respect.  Some of it seems to have come
> > from the implementation history - cgroup core grew out of cpuset - but
> > big part stems from cpuset's need to migrate tasks to an ancestor
> > cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
> > memory).
> > 
> > This patchset decouples cpuset locking from cgroup_mutex.  After the
> > patchset, cpuset uses cpuset-specific cpuset_mutex instead of
> > cgroup_mutex.  This also removes the lockdep warning triggered during
> > cpu offlining (see 0009).
> > 
> > Note that this leaves memcg as the only external user of cgroup_mutex.
> > Michal, Kame, can you guys please convert memcg to use its own locking
> > too?
> > 
> 
> Hmm. let me seeat quick glance cgroup_lock() is used at
>   hierarchy policy change
>   kmem_limit
>   migration policy change
>   swapiness change
>   oom control
> 
> Because all aboves takes care of changes in hierarchy,
> Having a new memcg's mutex in ->create() may be a way.
> 
> Ah, hm, Costa is mentioning task-attach. is the task-attach problem in
> memcg ?

Yes because we do not want to leak charges if we race with one of the
above hierarchy operation. Swappiness and oom control are not a big
deal. Same applies to migration policy change.
Those could be solved by using the same memcg lock in the attach hook.
Hierarchy policy change would be a bigger issue because the task is
already linked to the group when the callback is called. Same applies to
kmem_limit. Sorry I didn't have time to look into this deeper so I
cannot offer any solution right now.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

2012-11-30 Thread Michal Hocko

On Fri 30-11-12 13:00:36, Glauber Costa wrote:
> On 11/30/2012 07:21 AM, Kamezawa Hiroyuki wrote:
> > (2012/11/29 6:34), Tejun Heo wrote:
> >> Hello, guys.
> >>
> >> Depending on cgroup core locking - cgroup_mutex - is messy and makes
> >> cgroup prone to locking dependency problems.  The current code already
> >> has lock dependency loop - memcg nests get_online_cpus() inside
> >> cgroup_mutex.  cpuset the other way around.
> >>
> >> Regardless of the locking details, whatever is protecting cgroup has
> >> inherently to be something outer to most other locking constructs.
> >> cgroup calls into a lot of major subsystems which in turn have to
> >> perform subsystem-specific locking.  Trying to nest cgroup
> >> synchronization inside other locks isn't something which can work
> >> well.
> >>
> >> cgroup now has enough API to allow subsystems to implement their own
> >> locking and cgroup_mutex is scheduled to be made private to cgroup
> >> core.  This patchset makes cpuset implement its own locking instead of
> >> relying on cgroup_mutex.
> >>
> >> cpuset is rather nasty in this respect.  Some of it seems to have come
> >> from the implementation history - cgroup core grew out of cpuset - but
> >> big part stems from cpuset's need to migrate tasks to an ancestor
> >> cgroup when an hotunplug event makes a cpuset empty (w/o any cpu or
> >> memory).
> >>
> >> This patchset decouples cpuset locking from cgroup_mutex.  After the
> >> patchset, cpuset uses cpuset-specific cpuset_mutex instead of
> >> cgroup_mutex.  This also removes the lockdep warning triggered during
> >> cpu offlining (see 0009).
> >>
> >> Note that this leaves memcg as the only external user of cgroup_mutex.
> >> Michal, Kame, can you guys please convert memcg to use its own locking
> >> too?
> >>
> > 
> > Hmm. let me seeat quick glance cgroup_lock() is used at
> >   hierarchy policy change
> >   kmem_limit
> >   migration policy change
> >   swapiness change
> >   oom control
> > 
> > Because all aboves takes care of changes in hierarchy,
> > Having a new memcg's mutex in ->create() may be a way.
> > 
> > Ah, hm, Costa is mentioning task-attach. is the task-attach problem in 
> > memcg ?
> > 
> 
> We disallow the kmem limit to be set if a task already exists in the
> cgroup. So we can't allow a new task to attach if we are setting the limit.

This is racy without additional locking, isn't it?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET cgroup/for-3.8] cpuset: decouple cpuset locking from cgroup core

2012-11-30 Thread Michal Hocko

On Fri 30-11-12 13:42:28, Glauber Costa wrote:
[...]
> Speaking of it: Tejun's tree still lacks the kmem bits. How hard would
> it be for you to merge his branch into a temporary branch of your tree?

review-cpuset-locking is based on a post merge window merges so I cannot
merge it as is. I could cherry-pick the series after it is settled. I
have no idea how much conflicts this would bring, though.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v2 5/6] memcg: further simplify mem_cgroup_iter

2012-11-30 Thread Michal Hocko

On Fri 30-11-12 13:08:35, Glauber Costa wrote:
> On 11/26/2012 10:47 PM, Michal Hocko wrote:
> > The code would be much more easier to follow if we move the iteration
> > outside of the function (to __mem_cgrou_iter_next) so the distinction
> > is more clear.
> totally nit: Why is it call __mem_cgrou ?
> 
> What happened to your p ?

It was a fight against p as a source of all evil but the fight is over
and we can put it back :p 

Thanks for noticing
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] Drivers: hv: balloon: Support 2M page allocations for ballooning

2013-03-18 Thread Michal Hocko

On Sat 16-03-13 14:42:05, K. Y. Srinivasan wrote:
> While ballooning memory out of the guest, attempt 2M allocations first.
> If 2M allocations fail, then go for 4K allocations. In cases where we
> have performed 2M allocations, split this 2M page so that we can free this
> page at 4K granularity (when the host returns the memory).

Maybe I am missing something but what is the advantage of 2M allocation
when you split it up immediately so you are not using it as a huge page?

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] Drivers: hv: balloon: Support 2M page allocations for ballooning

2013-03-18 Thread Michal Hocko

On Mon 18-03-13 11:52:57, Michal Hocko wrote:
> On Sat 16-03-13 14:42:05, K. Y. Srinivasan wrote:
> > While ballooning memory out of the guest, attempt 2M allocations first.
> > If 2M allocations fail, then go for 4K allocations. In cases where we
> > have performed 2M allocations, split this 2M page so that we can free this
> > page at 4K granularity (when the host returns the memory).
> 
> Maybe I am missing something but what is the advantage of 2M allocation
> when you split it up immediately so you are not using it as a huge page?

OK, it seems that Patch 1/2 says some advantages. That description
should be part of this patch IMO.
You are saying that a) it reduces fragmentation on the host b) it makes
ballooning more effective.

Have you measured any effects on the ability to allocate huge pages on
the host? What prevents from depleting high order pages on the host with
many guests? Also have you measured any effect on THP allocation success
rate on the host?

Could you clarify how this helps ballooning when you split the page
right after you allocate it?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: Export split_page()

2013-03-18 Thread Michal Hocko

On Sat 16-03-13 14:42:04, K. Y. Srinivasan wrote:
> The split_page() function will be very useful for balloon drivers. On Hyper-V,
> it will be very efficient to use 2M allocations in the guest as this (a) makes
> the ballooning protocol with the host that much more efficient and (b) moving
> memory in 2M chunks minimizes fragmentation in the host. Export the 
> split_page()
> function to let the guest allocations be in 2M chunks while the host is free 
> to
> return this memory at arbitrary granularity.
> 
> Signed-off-by: K. Y. Srinivasan 

I do not have any objections to exporting the symbol (at least we
prevent drivers code from inventing their own split_page) but the
Hyper-V specific description should go into Hyper-V patch IMO.

So for the export with a short note that the symbol will be used by
Hyper-V
Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c |1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cacfee..7e0ead6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1404,6 +1404,7 @@ void split_page(struct page *page, unsigned int order)
>   for (i = 1; i < (1 << order); i++)
>   set_page_refcounted(page + i);
>  }
> +EXPORT_SYMBOL_GPL(split_page);
>  
>  static int __isolate_free_page(struct page *page, unsigned int order)
>  {
> -- 
> 1.7.4.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] Drivers: hv: balloon: Support 2M page allocations for ballooning

2013-03-18 Thread Michal Hocko

On Mon 18-03-13 13:44:05, KY Srinivasan wrote:
> 
> 
> > -Original Message-
> > From: Michal Hocko [mailto:mho...@suse.cz]
> > Sent: Monday, March 18, 2013 6:53 AM
> > To: KY Srinivasan
> > Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> > a...@firstfloor.org; a...@linux-foundation.org; linux...@kvack.org;
> > kamezawa.hiroy...@gmail.com; han...@cmpxchg.org; ying...@google.com
> > Subject: Re: [PATCH 2/2] Drivers: hv: balloon: Support 2M page allocations 
> > for
> > ballooning
> > 
> > On Sat 16-03-13 14:42:05, K. Y. Srinivasan wrote:
> > > While ballooning memory out of the guest, attempt 2M allocations first.
> > > If 2M allocations fail, then go for 4K allocations. In cases where we
> > > have performed 2M allocations, split this 2M page so that we can free this
> > > page at 4K granularity (when the host returns the memory).
> > 
> > Maybe I am missing something but what is the advantage of 2M allocation
> > when you split it up immediately so you are not using it as a huge page?
> 
> The Hyper-V ballooning protocol specifies the pages being ballooned as
> page ranges - start_pfn: number_of_pfns. So, when the guest balloon
> is inflating and I am able to allocate 2M pages, I will be able to
> represent 512 contiguous pages in one 64 bit entry and this makes the
> ballooning operation that much more efficient. The reason I split the
> page is that the host does not guarantee that when it returns the
> memory to the guest, it will return in any particular granularity and
> so I have to be able to free this memory in 4K granularity. This is
> the corner case that I will have to handle.

Thanks for the clarification. I think this information would be valuable
in the changelog.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/9] migrate: add migrate_entry_wait_huge()

2013-03-18 Thread Michal Hocko

On Thu 21-02-13 14:41:40, Naoya Horiguchi wrote:
[...]
> diff --git v3.8.orig/mm/migrate.c v3.8/mm/migrate.c
> index 2fd8b4a..7d84f4c 100644
> --- v3.8.orig/mm/migrate.c
> +++ v3.8/mm/migrate.c
> @@ -236,6 +236,30 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t 
> *pmd,
>   pte_unmap_unlock(ptep, ptl);
>  }
>  
> +void migration_entry_wait_huge(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long address)
> +{
> + spinlock_t *ptl = pte_lockptr(mm, pmd);
> + pte_t pte;
> + swp_entry_t entry;
> + struct page *page;
> +
> + spin_lock(ptl);
> + pte = huge_ptep_get((pte_t *)pmd);
> + if (!is_hugetlb_entry_migration(pte))
> + goto out;
> + entry = pte_to_swp_entry(pte);
> + page = migration_entry_to_page(entry);
> + if (!get_page_unless_zero(page))
> + goto out;
> + spin_unlock(ptl);
> + wait_on_page_locked(page);
> + put_page(page);
> + return;
> +out:
> + spin_unlock(ptl);
> +}

This duplicates a lot of code from migration_entry_wait. Can we just
teach the generic one to be HugePage aware instead?
All it takes is just opencoding pte_offset_map_lock and calling
huge_ptep_get ofr HugePage and pte_offset_map otherwise.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/9] migrate: make core migration code aware of hugepage

2013-03-18 Thread Michal Hocko

On Thu 21-02-13 14:41:41, Naoya Horiguchi wrote:
[...]
> diff --git v3.8.orig/include/linux/mempolicy.h v3.8/include/linux/mempolicy.h
> index 0d7df39..2e475b5 100644
> --- v3.8.orig/include/linux/mempolicy.h
> +++ v3.8/include/linux/mempolicy.h
> @@ -173,7 +173,7 @@ extern int mpol_to_str(char *buffer, int maxlen, struct 
> mempolicy *pol);
>  /* Check if a vma is migratable */
>  static inline int vma_migratable(struct vm_area_struct *vma)
>  {
> - if (vma->vm_flags & (VM_IO | VM_HUGETLB | VM_PFNMAP))
> + if (vma->vm_flags & (VM_IO | VM_PFNMAP))
>   return 0;

Is this safe? At least check_*_range don't seem to be hugetlb aware.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/9] migrate: make core migration code aware of hugepage

2013-03-18 Thread Michal Hocko

On Mon 18-03-13 16:22:24, Michal Hocko wrote:
> On Thu 21-02-13 14:41:41, Naoya Horiguchi wrote:
> [...]
> > diff --git v3.8.orig/include/linux/mempolicy.h 
> > v3.8/include/linux/mempolicy.h
> > index 0d7df39..2e475b5 100644
> > --- v3.8.orig/include/linux/mempolicy.h
> > +++ v3.8/include/linux/mempolicy.h
> > @@ -173,7 +173,7 @@ extern int mpol_to_str(char *buffer, int maxlen, struct 
> > mempolicy *pol);
> >  /* Check if a vma is migratable */
> >  static inline int vma_migratable(struct vm_area_struct *vma)
> >  {
> > -   if (vma->vm_flags & (VM_IO | VM_HUGETLB | VM_PFNMAP))
> > +   if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> > return 0;
> 
> Is this safe? At least check_*_range don't seem to be hugetlb aware.

Ohh, they become in 5/9. Should that one be reordered then?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/9] migrate: enable migrate_pages() to migrate hugepage

2013-03-18 Thread Michal Hocko

On Thu 21-02-13 14:41:44, Naoya Horiguchi wrote:
> This patch extends check_range() to handle vma with VM_HUGETLB set.
> With this changes, we can migrate hugepage with migrate_pages(2).
> Note that for larger hugepages (covered by pud entries, 1GB for
> x86_64 for example), we simply skip it now.
> 
> Signed-off-by: Naoya Horiguchi 
> ---
>  include/linux/hugetlb.h |  6 --
>  mm/hugetlb.c| 10 ++
>  mm/mempolicy.c  | 46 ++
>  3 files changed, 48 insertions(+), 14 deletions(-)
> 
> diff --git v3.8.orig/include/linux/hugetlb.h v3.8/include/linux/hugetlb.h
> index 8f87115..eb33df5 100644
> --- v3.8.orig/include/linux/hugetlb.h
> +++ v3.8/include/linux/hugetlb.h
> @@ -69,6 +69,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long 
> offset, long freed);
>  int dequeue_hwpoisoned_huge_page(struct page *page);
>  void putback_active_hugepage(struct page *page);
>  void putback_active_hugepages(struct list_head *l);
> +void migrate_hugepage_add(struct page *page, struct list_head *list);
>  void copy_huge_page(struct page *dst, struct page *src);
>  
>  extern unsigned long hugepages_treat_as_movable;
> @@ -88,8 +89,8 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned 
> long address,
>   pmd_t *pmd, int write);
>  struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
>   pud_t *pud, int write);
> -int pmd_huge(pmd_t pmd);
> -int pud_huge(pud_t pmd);
> +extern int pmd_huge(pmd_t pmd);
> +extern int pud_huge(pud_t pmd);

extern is not needed here.

>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>   unsigned long address, unsigned long end, pgprot_t newprot);
>  
> @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct 
> page *page)
>  
>  #define putback_active_hugepage(p) 0
>  #define putback_active_hugepages(l) 0
> +#define migrate_hugepage_add(p, l) 0
>  static inline void copy_huge_page(struct page *dst, struct page *src)
>  {
>  }
> diff --git v3.8.orig/mm/hugetlb.c v3.8/mm/hugetlb.c
> index cb9d43b8..86ffcb7 100644
> --- v3.8.orig/mm/hugetlb.c
> +++ v3.8/mm/hugetlb.c
> @@ -3202,3 +3202,13 @@ void putback_active_hugepages(struct list_head *l)
>   list_for_each_entry_safe(page, page2, l, lru)
>   putback_active_hugepage(page);
>  }
> +
> +void migrate_hugepage_add(struct page *page, struct list_head *list)
> +{
> + VM_BUG_ON(!PageHuge(page));
> + get_page(page);
> + spin_lock(&hugetlb_lock);

Why hugetlb_lock? Comment for this lock says that it protects
hugepage_freelists, nr_huge_pages, and free_huge_pages.

> + list_move_tail(&page->lru, list);
> + spin_unlock(&hugetlb_lock);
> + return;
> +}
> diff --git v3.8.orig/mm/mempolicy.c v3.8/mm/mempolicy.c
> index e2df1c1..8627135 100644
> --- v3.8.orig/mm/mempolicy.c
> +++ v3.8/mm/mempolicy.c
> @@ -525,6 +525,27 @@ static int check_pte_range(struct vm_area_struct *vma, 
> pmd_t *pmd,
>   return addr != end;
>  }
>  
> +static void check_hugetlb_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
> + const nodemask_t *nodes, unsigned long flags,
> + void *private)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> + int nid;
> + struct page *page;
> +
> + spin_lock(&vma->vm_mm->page_table_lock);
> + page = pte_page(huge_ptep_get((pte_t *)pmd));
> + spin_unlock(&vma->vm_mm->page_table_lock);

I am a bit confused why page_table_lock is used here and why it doesn't
cover the page usage.

> + nid = page_to_nid(page);
> + if (node_isset(nid, *nodes) != !!(flags & MPOL_MF_INVERT)
> + && ((flags & MPOL_MF_MOVE && page_mapcount(page) == 1)
> + || flags & MPOL_MF_MOVE_ALL))
> + migrate_hugepage_add(page, private);
> +#else
> + BUG();
> +#endif
> +}
> +
>  static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>   unsigned long addr, unsigned long end,
>   const nodemask_t *nodes, unsigned long flags,
> @@ -536,6 +557,11 @@ static inline int check_pmd_range(struct vm_area_struct 
> *vma, pud_t *pud,
>   pmd = pmd_offset(pud, addr);
>   do {
>   next = pmd_addr_end(addr, end);
> + if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {

Why an explicit check for is_vm_hugetlb_page here? Isn't pmd_huge()
sufficient?

> + check_hugetlb_pmd_range(vma, pmd, nodes,
> + flags, private);
> +

Re: [PATCH 9/9] remove /proc/sys/vm/hugepages_treat_as_movable

2013-03-18 Thread Michal Hocko

 * have no page reserves. This check ensures that reservations are
> @@ -558,7 +557,7 @@ static struct page *dequeue_huge_page_vma(struct hstate 
> *h,
>  
>   for_each_zone_zonelist_nodemask(zone, z, zonelist,
>   MAX_NR_ZONES - 1, nodemask) {
> - if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> + if (cpuset_zone_allowed_softwall(zone, GFP_HIGHUSER_MOVABLE)) {
>   page = dequeue_huge_page_node(h, zone_to_nid(zone));
>   if (page) {
>   if (!avoid_reserve)
> @@ -698,7 +697,7 @@ static struct page *alloc_fresh_huge_page_node(struct 
> hstate *h, int nid)
>   return NULL;
>  
>   page = alloc_pages_exact_node(nid,
> - htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
> + GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_THISNODE|
>   __GFP_REPEAT|__GFP_NOWARN,
>   huge_page_order(h));
>   if (page) {
> @@ -909,12 +908,12 @@ static struct page *alloc_buddy_huge_page(struct hstate 
> *h, int nid)
>   spin_unlock(&hugetlb_lock);
>  
>   if (nid == NUMA_NO_NODE)
> - page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> + page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
>  __GFP_REPEAT|__GFP_NOWARN,
>  huge_page_order(h));
>   else
>   page = alloc_pages_exact_node(nid,
> - htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
> + GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_THISNODE|
>   __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
>  
>   if (page && arch_prepare_hugepage(page)) {
> @@ -2078,18 +2077,6 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table 
> *table, int write,
>  }
>  #endif /* CONFIG_NUMA */
>  
> -int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
> - void __user *buffer,
> - size_t *length, loff_t *ppos)
> -{
> - proc_dointvec(table, write, buffer, length, ppos);
> - if (hugepages_treat_as_movable)
> - htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
> - else
> - htlb_alloc_mask = GFP_HIGHUSER;
> - return 0;
> -}
> -
>  int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>   void __user *buffer,
>   size_t *length, loff_t *ppos)
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] memory-hotplug: enable memory hotplug to handle hugepage

2013-03-18 Thread Michal Hocko

for (pfn = start_pfn; pfn < end_pfn; pfn += step)
> + dissolve_free_huge_page(pfn_to_page(pfn));
> +}
> +
>  static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>  {
>   struct page *page;
> @@ -3158,6 +3182,25 @@ static int is_hugepage_on_freelist(struct page *hpage)
>   return 0;
>  }
>  
> +/* Returns true for head pages of in-use hugepages, otherwise returns false. 
> */
> +int is_hugepage_movable(struct page *hpage)
> +{
> + struct page *page;
> + struct page *tmp;
> + struct hstate *h = page_hstate(hpage);
> + int ret = 0;
> +
> + VM_BUG_ON(!PageHuge(hpage));
> + if (PageTail(hpage))
> + return 0;
> + spin_lock(&hugetlb_lock);
> + list_for_each_entry_safe(page, tmp, &h->hugepage_activelist, lru)
> + if (page == hpage)
> + ret = 1;
> + spin_unlock(&hugetlb_lock);
> + return ret;
> +}
> +
>  /*
>   * This function is called from memory failure code.
>   * Assume the caller holds page lock of the head page.
> diff --git v3.8.orig/mm/memory_hotplug.c v3.8/mm/memory_hotplug.c
> index d04ed87..6418de2 100644
> --- v3.8.orig/mm/memory_hotplug.c
> +++ v3.8/mm/memory_hotplug.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -985,10 +986,12 @@ static int test_pages_in_a_zone(unsigned long 
> start_pfn, unsigned long end_pfn)
>  }
>  
>  /*
> - * Scanning pfn is much easier than scanning lru list.
> - * Scan pfn from start to end and Find LRU page.
> + * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
> + * and hugepages). We scan pfn because it's much easier than scanning over
> + * linked list. This function returns the pfn of the first found movable
> + * page if it's found, otherwise 0.
>   */
> -static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> +static unsigned long scan_movable_pages(unsigned long start, unsigned long 
> end)
>  {
>   unsigned long pfn;
>   struct page *page;
> @@ -997,6 +1000,12 @@ static unsigned long scan_lru_pages(unsigned long 
> start, unsigned long end)
>   page = pfn_to_page(pfn);
>   if (PageLRU(page))
>   return pfn;
> + if (PageHuge(page)) {
> + if (is_hugepage_movable(page))
> + return pfn;
> + else
> + pfn += (1 << compound_order(page)) - 1;
> + }

scan_lru_pages's name gets really confusing after this change because
hugetlb pages are not on the LRU. Maybe it would be good to rename it.

>   }
>   }
>   return 0;
> @@ -1019,6 +1028,30 @@ do_migrate_range(unsigned long start_pfn, unsigned 
> long end_pfn)
>   page = pfn_to_page(pfn);
>   if (!get_page_unless_zero(page))
>   continue;

All tail pages have 0 reference count (according to prep_compound_page)
so they would be skipped anyway. This makes the below pfn tweaks
pointless.

> + if (PageHuge(page)) {
> + /*
> +  * Larger hugepage (1GB for x86_64) is larger than
> +  * memory block, so pfn scan can start at the tail
> +  * page of larger hugepage. In such case,
> +  * we simply skip the hugepage and move the cursor
> +  * to the last tail page.
> +  */
> + if (PageTail(page)) {
> + struct page *head = compound_head(page);
> + pfn = page_to_pfn(head) +
> + (1 << compound_order(head)) - 1;
> + put_page(page);
> +     continue;
> + }
> + pfn = (1 << compound_order(page)) - 1;
> + if (huge_page_size(page_hstate(page)) != PMD_SIZE) {
> + put_page(page);
> + continue;
> + }

There might be other hugepage sizes which fit into memblock so this test
doesn't seem right.

> + list_move_tail(&page->lru, &source);
> + move_pages -= 1 << compound_order(page);
> + continue;
> + }
>   /*
>* We can skip free pages. And we can only deal with pages on
>* LRU.
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] Drivers: hv: balloon: Support 2M page allocations for ballooning

2013-03-18 Thread Michal Hocko

On Mon 18-03-13 15:08:36, KY Srinivasan wrote:
> What is your recommendation with regards which tree the mm patch needs
> to go through; the Hyper-V balloon driver patch will go through Greg's
> tree.

I would say via Andrew but there are dependencies between those two so a
single tree would be less confusing /me thinks.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/9] migrate: enable migrate_pages() to migrate hugepage

2013-03-19 Thread Michal Hocko

On Mon 18-03-13 20:07:16, Naoya Horiguchi wrote:
> On Mon, Mar 18, 2013 at 04:40:57PM +0100, Michal Hocko wrote:
> > On Thu 21-02-13 14:41:44, Naoya Horiguchi wrote:
[...]
> > > @@ -3202,3 +3202,13 @@ void putback_active_hugepages(struct list_head *l)
> > >   list_for_each_entry_safe(page, page2, l, lru)
> > >   putback_active_hugepage(page);
> > >  }
> > > +
> > > +void migrate_hugepage_add(struct page *page, struct list_head *list)
> > > +{
> > > + VM_BUG_ON(!PageHuge(page));
> > > + get_page(page);
> > > + spin_lock(&hugetlb_lock);
> > 
> > Why hugetlb_lock? Comment for this lock says that it protects
> > hugepage_freelists, nr_huge_pages, and free_huge_pages.
> 
> I think that this comment is out of date and hugepage_activelists,
> which was introduced recently, should be protected because this
> patchset adds is_hugepage_movable() which runs through the list.
> So I'll update the comment in the next post.
> 
> > > + list_move_tail(&page->lru, list);
> > > + spin_unlock(&hugetlb_lock);
> > > + return;
> > > +}
> > > diff --git v3.8.orig/mm/mempolicy.c v3.8/mm/mempolicy.c
> > > index e2df1c1..8627135 100644
> > > --- v3.8.orig/mm/mempolicy.c
> > > +++ v3.8/mm/mempolicy.c
> > > @@ -525,6 +525,27 @@ static int check_pte_range(struct vm_area_struct 
> > > *vma, pmd_t *pmd,
> > >   return addr != end;
> > >  }
> > >  
> > > +static void check_hugetlb_pmd_range(struct vm_area_struct *vma, pmd_t 
> > > *pmd,
> > > + const nodemask_t *nodes, unsigned long flags,
> > > + void *private)
> > > +{
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > + int nid;
> > > + struct page *page;
> > > +
> > > + spin_lock(&vma->vm_mm->page_table_lock);
> > > + page = pte_page(huge_ptep_get((pte_t *)pmd));
> > > + spin_unlock(&vma->vm_mm->page_table_lock);
> > 
> > I am a bit confused why page_table_lock is used here and why it doesn't
> > cover the page usage.
> 
> I expected this function to do the same for pmd as check_pte_range() does
> for pte, but the above code didn't do it. I should've put spin_unlock
> below migrate_hugepage_add(). Sorry for the confusion.

OK, I see. So you want to prevent from racing with pmd unmap.

> > > + nid = page_to_nid(page);
> > > + if (node_isset(nid, *nodes) != !!(flags & MPOL_MF_INVERT)
> > > + && ((flags & MPOL_MF_MOVE && page_mapcount(page) == 1)
> > > + || flags & MPOL_MF_MOVE_ALL))
> > > + migrate_hugepage_add(page, private);
> > > +#else
> > > + BUG();
> > > +#endif
> > > +}
> > > +
> > >  static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> > >   unsigned long addr, unsigned long end,
> > >   const nodemask_t *nodes, unsigned long flags,
> > > @@ -536,6 +557,11 @@ static inline int check_pmd_range(struct 
> > > vm_area_struct *vma, pud_t *pud,
> > >   pmd = pmd_offset(pud, addr);
> > >   do {
> > >   next = pmd_addr_end(addr, end);
> > > + if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
> > 
> > Why an explicit check for is_vm_hugetlb_page here? Isn't pmd_huge()
> > sufficient?
> 
> I think we need both check here because if we use only pmd_huge(),
> pmd for thp goes into this branch wrongly. 

Bahh. You are right. I thought that pmd_huge is hugetlb thingy but it
obviously checks only _PAGE_PSE same as pmd_large() which is really
unfortunate and confusing. Can we make it hugetlb specific?

> 
> Thanks,
> Naoya
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

2013-03-19 Thread Michal Hocko

On Tue 19-03-13 11:08:23, Simon Jeons wrote:
> Hi Mel,
> On 03/17/2013 09:04 PM, Mel Gorman wrote:
> >kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> >pages have been reclaimed or the pgdat is considered balanced. It then
> >rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> >reclaim needs to be reset. This is not wrong per-se but it is confusing
> 
> per-se is short for what?
> 
> >to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> >restarts before it has scanned enough pages to meet the high watermark even
> >at 100% efficiency. This patch irons out the logic a bit by controlling
> >when priority is raised and removing the "goto loop_again".
> >
> >This patch has kswapd raise the scanning priority until it is scanningmm: 
> >vmscan: Flatten kswapd priority loop
> >enough pages that it could meet the high watermark in one shrink of the
> >LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> 
> Which kind of reclaim can be treated as 100% efficiency?

nr_scanned == nr_reclaimed
 
> >the scanning prioirty higher unless it is failing to reclaim any pages.
> >
> >To avoid infinite looping for high-order allocation requests kswapd will
> >not reclaim for high-order allocations when it has reclaimed at least
> >twice the number of pages as the allocation request.
> >
> >Signed-off-by: Mel Gorman 
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] memcg: do not check for do_swap_account in mem_cgroup_{read,write,reset}

2013-03-19 Thread Michal Hocko

since 2d11085e (memcg: do not create memsw files if swap accounting
is disabled) memsw files are created only if memcg swap accounting is
enabled so there doesn't make any sense to check for it explicitely in
mem_cgroup_read, mem_cgroup_write and mem_cgroup_reset.

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |9 -
 1 file changed, 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2e01167..f608546 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5091,9 +5091,6 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, 
struct cftype *cft,
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
 
-   if (!do_swap_account && type == _MEMSWAP)
-   return -EOPNOTSUPP;
-
switch (type) {
case _MEM:
if (name == RES_USAGE)
@@ -5329,9 +5326,6 @@ static int mem_cgroup_write(struct cgroup *cont, struct 
cftype *cft,
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
 
-   if (!do_swap_account && type == _MEMSWAP)
-   return -EOPNOTSUPP;
-
switch (name) {
case RES_LIMIT:
if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
@@ -5408,9 +5402,6 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned 
int event)
type = MEMFILE_TYPE(event);
name = MEMFILE_ATTR(event);
 
-   if (!do_swap_account && type == _MEMSWAP)
-   return -EOPNOTSUPP;
-
switch (name) {
case RES_MAX_USAGE:
if (type == _MEM)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/3] mm: Export split_page()

2013-03-19 Thread Michal Hocko

On Mon 18-03-13 13:51:36, K. Y. Srinivasan wrote:
> This symbol would be used in the Hyper-V balloon driver to support 2M
> allocations.
> 
> In this version of the patch, based on feedback from Michal Hocko
> , I have updated the patch description.

I guess this part is not necessary ;)

> 
> Signed-off-by: K. Y. Srinivasan 

Anyway
Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c |1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cacfee..7e0ead6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1404,6 +1404,7 @@ void split_page(struct page *page, unsigned int order)
>   for (i = 1; i < (1 << order); i++)
>   set_page_refcounted(page + i);
>  }
> +EXPORT_SYMBOL_GPL(split_page);
>  
>  static int __isolate_free_page(struct page *page, unsigned int order)
>  {
> -- 
> 1.7.4.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 2/3] Drivers: hv: balloon: Support 2M page allocations for ballooning

2013-03-19 Thread Michal Hocko

On Mon 18-03-13 13:51:37, K. Y. Srinivasan wrote:
> On Hyper-V it will be very efficient to use 2M allocations in the guest as 
> this
> makes the ballooning protocol with the host that much more efficient. Hyper-V
> uses page ranges (start pfn : number of pages) to specify memory being moved
> around and with 2M pages this encoding can be very efficient. However, when
> memory is returned to the guest, the host does not guarantee any granularity.
> To deal with this issue, split the page soon after a successful 2M allocation
> so that this memory can potentially be freed as 4K pages.

How many pages are requested usually?

> If 2M allocations fail, we revert to 4K allocations.
> 
> In this version of the patch, based on the feedback from Michal Hocko
> , I have added some additional commentary to the patch
> description. 
> 
> Signed-off-by: K. Y. Srinivasan 

I am not going to ack the patch because I am still not entirely
convinced that big allocations are worth it. But that is up to you and
hyper-V users.

> ---
>  drivers/hv/hv_balloon.c |   18 --
>  1 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index 2cf7d4e..71655b4 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -997,6 +997,14 @@ static int  alloc_balloon_pages(struct hv_dynmem_device 
> *dm, int num_pages,
>  
>   dm->num_pages_ballooned += alloc_unit;
>  
> + /*
> +  * If we allocatted 2M pages; split them so we
> +  * can free them in any order we get.
> +  */
> +
> + if (alloc_unit != 1)
> + split_page(pg, get_order(alloc_unit << PAGE_SHIFT));
> +
>   bl_resp->range_count++;
>   bl_resp->range_array[i].finfo.start_page =
>   page_to_pfn(pg);

I would suggest also using __GFP_NO_KSWAPD (or basically use
GFP_TRANSHUGE for alloc_unit>0) for the allocation to be as least
disruptive as possible.

> @@ -1023,9 +1031,10 @@ static void balloon_up(struct work_struct *dummy)
>  
>  
>   /*
> -  * Currently, we only support 4k allocations.
> +  * We will attempt 2M allocations. However, if we fail to
> +  * allocate 2M chunks, we will go back to 4k allocations.
>*/
> - alloc_unit = 1;
> + alloc_unit = 512;
>  
>   while (!done) {
>   bl_resp = (struct dm_balloon_response *)send_buffer;
> @@ -1041,6 +1050,11 @@ static void balloon_up(struct work_struct *dummy)
>   bl_resp, alloc_unit,
>&alloc_error);
>  

You should handle alloc_balloon_pages returns 0 && !alloc_error which
happens when num_pages < alloc_unit.

> + if ((alloc_error) && (alloc_unit != 1)) {
> + alloc_unit = 1;
> + continue;
> + }
> +
>   if ((alloc_error) || (num_ballooned == num_pages)) {
>   bl_resp->more_pages = 0;
>   done = true;
> -- 
> 1.7.4.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [bugfix] mm: zone_end_pfn is too small

2013-03-19 Thread Michal Hocko

Adding Andrew

On Mon 18-03-13 10:37:05, Russ Anderson wrote:
> Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output below).
> 
> The key hint is "page 4294967296 outside zone".
> 4294967296 = 0x1 (bit 32 is set).
> 
> The problem is in include/linux/mmzone.h:
> 
> 530 static inline unsigned zone_end_pfn(const struct zone *zone)
> 531 {
> 532 return zone->zone_start_pfn + zone->spanned_pages;
> 533 }
> 
> zone_end_pfn is "unsigned" (32 bits).  Changing it to 
> "unsigned long" (64 bits) fixes the problem.
> 
> zone_end_pfn() was added recently in commit 
> 108bcc96ef7047c02cad4d229f04da38186a3f3f.
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/include/linux/mmzone.h?id=108bcc96ef7047c02cad4d229f04da38186a3f3f
> 
> 
> Output from the failure.
> 
>   No AGP bridge found
>   page 4294967296 outside zone [ 4294967296 - 4327469056 ]
>   [ cut here ]
>   kernel BUG at mm/page_alloc.c:552!
>   invalid opcode:  [#1] SMP 
>   Modules linked in:
>   CPU 0 
>   Pid: 0, comm: swapper Not tainted 3.9.0-rc2.dtp+ #10  
>   RIP: 0010:[]  [] 
> free_one_page+0x382/0x430
>   RSP: :81943d98  EFLAGS: 00010002
>   RAX: 0001 RBX: ea40 RCX: b3d9
>   RDX:  RSI: 0086 RDI: 0046
>   RBP: 81943df8 R08: 0040 R09: 0023
>   R10: 34bf R11: 34bf R12: ea40
>   R13: 981efefd9d80 R14: 0006 R15: 0006
>   FS:  () GS:c900() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: c7defefff000 CR3: 0194e000 CR4: 000406b0
>   DR0:  DR1:  DR2: 
>   DR3:  DR6: 0ff0 DR7: 0400
>   Process swapper (pid: 0, threadinfo 81942000, task 81955420)
>   Stack:
>00ff c90100a0 c90100d0 0040
> 81148809 0097 ea40
>0006 0002 000101e82bc0 ea401000
>   Call Trace:
>[] __free_pages_ok+0x96/0xb0
>[] __free_pages+0x25/0x50
>[] __free_pages_bootmem+0x8a/0x8c
>[] __free_memory_core+0xea/0x131
>[] free_low_memory_core_early+0x4a/0x98
>[] free_all_bootmem+0x45/0x47
>[] mem_init+0x7b/0x14c
>[] start_kernel+0x216/0x433
>[] ? repair_env_string+0x5b/0x5b
>[] x86_64_start_reservations+0x2a/0x2c
>[] x86_64_start_kernel+0x144/0x153
>   Code: 89 f1 ba 01 00 00 00 31 f6 d3 e2 4c 89 ef e8 66 a4 01 00 e9 2c fe ff 
> ff 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 <0f> 0b eb fe 0f 
> 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 49 
>   RIP  [] free_one_page+0x382/0x430
>RSP 
>   ---[ end trace a7919e7f17c0a725 ]---
>   Kernel panic - not syncing: Attempted to kill the idle task!
> 
> Signed-off-by: Russ Anderson 
> Reported-by: George Beshers 
> Acked-by: Hedi Berriche 

makes sense and all the users already are unsingle long AFAICS

Acked-by: Michal Hocko 

> 
> ---
>  include/linux/mmzone.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux/include/linux/mmzone.h
> ===
> --- linux.orig/include/linux/mmzone.h 2013-03-18 10:06:59.744082190 -0500
> +++ linux/include/linux/mmzone.h  2013-03-18 10:23:27.374031648 -0500
> @@ -527,7 +527,7 @@ static inline int zone_is_oom_locked(con
>   return test_bit(ZONE_OOM_LOCKED, &zone->flags);
>  }
>  
> -static inline unsigned zone_end_pfn(const struct zone *zone)
> +static inline unsigned long zone_end_pfn(const struct zone *zone)
>  {
>   return zone->zone_start_pfn + zone->spanned_pages;
>  }
> -- 
> Russ Anderson, OS RAS/Partitioning Project Lead  
> SGI - Silicon Graphics Inc  r...@sgi.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/9] migrate: enable migrate_pages() to migrate hugepage

2013-03-20 Thread Michal Hocko

On Wed 20-03-13 02:12:54, Naoya Horiguchi wrote:
> On Tue, Mar 19, 2013 at 08:11:13AM +0100, Michal Hocko wrote:
> > On Mon 18-03-13 20:07:16, Naoya Horiguchi wrote:
> > > On Mon, Mar 18, 2013 at 04:40:57PM +0100, Michal Hocko wrote:
> > > > On Thu 21-02-13 14:41:44, Naoya Horiguchi wrote:
> ...
> > > > > @@ -536,6 +557,11 @@ static inline int check_pmd_range(struct 
> > > > > vm_area_struct *vma, pud_t *pud,
> > > > >   pmd = pmd_offset(pud, addr);
> > > > >   do {
> > > > >   next = pmd_addr_end(addr, end);
> > > > > + if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
> > > > 
> > > > Why an explicit check for is_vm_hugetlb_page here? Isn't pmd_huge()
> > > > sufficient?
> > > 
> > > I think we need both check here because if we use only pmd_huge(),
> > > pmd for thp goes into this branch wrongly. 
> > 
> > Bahh. You are right. I thought that pmd_huge is hugetlb thingy but it
> > obviously checks only _PAGE_PSE same as pmd_large() which is really
> > unfortunate and confusing. Can we make it hugetlb specific?
> 
> I agree that we had better fix this confusion.
> 
> What pmd_huge() (or pmd_large() in some architectures) does is just
> checking whether a given pmd is pointing to huge/large page or not.
> It does not say which type of hugepage it is.
> So it shouldn't be used to decide whether the hugepage are hugetlbfs or not.
> I think it would be better to introduce pmd_hugetlb() which has pmd and vma
> as arguments and returns true only for hugetlbfs pmd.
> Checking pmd_hugetlb() should come before checking pmd_trans_huge() because
> pmd_trans_huge() implicitly assumes that the vma which covers the virtual
> address of a given pmd is not hugetlbfs vma.
> 
> I'm interested in this cleanup, so will work on it after this patchset.

pnd_huge is used only at few places so it shouldn't be very big. On the
other hand you do not have vma always available so it is getting tricky.

Thanks
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] memory-hotplug: enable memory hotplug to handle hugepage

2013-03-20 Thread Michal Hocko

On Tue 19-03-13 23:55:33, Naoya Horiguchi wrote:
> On Mon, Mar 18, 2013 at 05:07:37PM +0100, Michal Hocko wrote:
> > On Thu 21-02-13 14:41:47, Naoya Horiguchi wrote:
[...]
> > > As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
> > > over them because it's larger than memory block. So we now simply leave
> > > it to fail as it is.
> > 
> > What we could do is to check whether there is a free gb huge page on
> > other node and migrate there.
> 
> Correct, and 1GB page migration needs more code in migration core code
> (mainly it's related to migration entry in pud) and enough testing,
> so I want to do it in separate patchset.

Sure, this was just a note that it is achievable not that it has to be
done in the patchset.

[...]
> > > diff --git v3.8.orig/mm/hugetlb.c v3.8/mm/hugetlb.c
> > > index ccf9995..c28e6c9 100644
> > > --- v3.8.orig/mm/hugetlb.c
> > > +++ v3.8/mm/hugetlb.c
> > > @@ -843,6 +843,30 @@ static int free_pool_huge_page(struct hstate *h, 
> > > nodemask_t *nodes_allowed,
> > >   return ret;
> > >  }
> > >  
> > > +/* Dissolve a given free hugepage into free pages. */
> > > +void dissolve_free_huge_page(struct page *page)
> > > +{
> > > + if (PageHuge(page) && !page_count(page)) {
> > 
> > Could you clarify why you are cheking page_count here? I assume it is to
> > make sure the page is free but what prevents it being increased before
> > you take hugetlb_lock?
> 
> There's nothing to prevent it, so it's not safe to check refcount outside
> hugetlb_lock.

OK, so the lock has to be moved up.

[...]
> > > diff --git v3.8.orig/mm/memory_hotplug.c v3.8/mm/memory_hotplug.c
> > > index d04ed87..6418de2 100644
> > > --- v3.8.orig/mm/memory_hotplug.c
> > > +++ v3.8/mm/memory_hotplug.c
> > > @@ -29,6 +29,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #include 
> > >  
> > > @@ -985,10 +986,12 @@ static int test_pages_in_a_zone(unsigned long 
> > > start_pfn, unsigned long end_pfn)
> > >  }
> > >  
> > >  /*
> > > - * Scanning pfn is much easier than scanning lru list.
> > > - * Scan pfn from start to end and Find LRU page.
> > > + * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
> > > + * and hugepages). We scan pfn because it's much easier than scanning 
> > > over
> > > + * linked list. This function returns the pfn of the first found movable
> > > + * page if it's found, otherwise 0.
> > >   */
> > > -static unsigned long scan_lru_pages(unsigned long start, unsigned long 
> > > end)
> > > +static unsigned long scan_movable_pages(unsigned long start, unsigned 
> > > long end)
> > >  {
> > >   unsigned long pfn;
> > >   struct page *page;
> > > @@ -997,6 +1000,12 @@ static unsigned long scan_lru_pages(unsigned long 
> > > start, unsigned long end)
> > >   page = pfn_to_page(pfn);
> > >   if (PageLRU(page))
> > >   return pfn;
> > > + if (PageHuge(page)) {
> > > + if (is_hugepage_movable(page))
> > > + return pfn;
> > > + else
> > > + pfn += (1 << compound_order(page)) - 1;
> > > + }
> > 
> > scan_lru_pages's name gets really confusing after this change because
> > hugetlb pages are not on the LRU. Maybe it would be good to rename it.
> 
> Yes, and that's done in right above chunk.

bahh, I am blind. I got confused by the name in the hunk header. Sorry
about that.

> 
> > 
> > >   }
> > >   }
> > >   return 0;
> > > @@ -1019,6 +1028,30 @@ do_migrate_range(unsigned long start_pfn, unsigned 
> > > long end_pfn)
> > >   page = pfn_to_page(pfn);
> > >   if (!get_page_unless_zero(page))
> > >   continue;
> > 
> > All tail pages have 0 reference count (according to prep_compound_page)
> > so they would be skipped anyway. This makes the below pfn tweaks
> > pointless.
> 
> I was totally mistaken about what we should do here, sorry. If we call
> do_migrate_range() for 1GB hugepage, we should return with error (maybe 
> -EBUSY)
> instead of just skipping it, otherwise the caller __o

Re: [patch] mm, hugetlb: include hugepages in meminfo

2013-03-20 Thread Michal Hocko

On Tue 19-03-13 17:18:12, David Rientjes wrote:
> Particularly in oom conditions, it's troublesome that hugetlb memory is 
> not displayed.  All other meminfo that is emitted will not add up to what 
> is expected, and there is no artifact left in the kernel log to show that 
> a potentially significant amount of memory is actually allocated as 
> hugepages which are not available to be reclaimed.

Yes, I like the idea. It's bitten me already in the past.

The only objection I have is that you print only default_hstate. You
just need to wrap your for_each_node_state by for_each_hstate to do
that.  With that applied, feel free to add my
Acked-by: Michal Hocko 

Thanks

> Booting with hugepages=8192 on the command line, this memory is now shown 
> in oom conditions.  For example, with echo m > /proc/sysrq-trigger:
> 
> Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> 
> Signed-off-by: David Rientjes 
> ---
>  include/linux/hugetlb.h |  4 
>  mm/hugetlb.c| 14 ++
>  mm/page_alloc.c |  3 +++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -58,6 +58,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>  int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
>  void hugetlb_report_meminfo(struct seq_file *);
>  int hugetlb_report_node_meminfo(int, char *);
> +void hugetlb_show_meminfo(void);
>  unsigned long hugetlb_total_pages(void);
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   unsigned long address, unsigned int flags);
> @@ -114,6 +115,9 @@ static inline void hugetlb_report_meminfo(struct seq_file 
> *m)
>  {
>  }
>  #define hugetlb_report_node_meminfo(n, buf)  0
> +static inline void hugetlb_show_meminfo(void)
> +{
> +}
>  #define follow_huge_pmd(mm, addr, pmd, write)NULL
>  #define follow_huge_pud(mm, addr, pud, write)NULL
>  #define prepare_hugepage_range(file, addr, len)  (-EINVAL)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2121,6 +2121,20 @@ int hugetlb_report_node_meminfo(int nid, char *buf)
>   nid, h->surplus_huge_pages_node[nid]);
>  }
>  
> +void hugetlb_show_meminfo(void)
> +{
> + struct hstate *h = &default_hstate;
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY)
> + pr_info("Node %d hugepages_total=%u hugepages_free=%u 
> hugepages_surp=%u hugepages_size=%lukB\n",
> + nid,
> + h->nr_huge_pages_node[nid],
> + h->free_huge_pages_node[nid],
> + h->surplus_huge_pages_node[nid],
> + 1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
> +}
> +
>  /* Return the number pages of memory we physically have, in PAGE_SIZE units. 
> */
>  unsigned long hugetlb_total_pages(void)
>  {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -58,6 +58,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -3105,6 +3106,8 @@ void show_free_areas(unsigned int filter)
>   printk("= %lukB\n", K(total));
>   }
>  
> + hugetlb_show_meminfo();
> +
>   printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
>  
>   show_swap_cache_info();
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-20 Thread Michal Hocko

On Sun 17-03-13 13:04:07, Mel Gorman wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..4835a7a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
> order, long remaining,
>  }
>  
>  /*
> + * kswapd shrinks the zone by the number of pages required to reach
> + * the high watermark.
> + */
> +static void kswapd_shrink_zone(struct zone *zone,
> +struct scan_control *sc,
> +unsigned long lru_pages)
> +{
> + unsigned long nr_slab;
> + struct reclaim_state *reclaim_state = current->reclaim_state;
> + struct shrink_control shrink = {
> + .gfp_mask = sc->gfp_mask,
> + };
> +
> + /* Reclaim above the high watermark. */
> + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));

OK, so the cap is at high watermark which sounds OK to me, although I
would expect balance_gap being considered here. Is it not used
intentionally or you just wanted to have a reasonable upper bound?

I am not objecting to that it just hit my eyes.

> + shrink_zone(zone, sc);
> +
> + reclaim_state->reclaimed_slab = 0;
> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +
> + if (nr_slab == 0 && !zone_reclaimable(zone))
> + zone->all_unreclaimable = 1;
> +}
> +
> +/*
>   * For kswapd, balance_pgdat() will work across all this node's zones until
>   * they are all at high_wmark_pages(zone).
>   *
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: page_alloc: Avoid marking zones full prematurely after zone_reclaim()

2013-03-20 Thread Michal Hocko

On Wed 20-03-13 18:19:57, Mel Gorman wrote:
> The following problem was reported against a distribution kernel when
> zone_reclaim was enabled but the same problem applies to the mainline
> kernel. The reproduction case was as follows
> 
> 1. Run numactl -m +0 dd if=largefile of=/dev/null
>This allocates a large number of clean pages in node 0
> 
> 2. numactl -N +0 memhog 0.5*Mg
>This start a memory-using application in node 0.
> 
> The expected behaviour is that the clean pages get reclaimed and the
> application uses node 0 for its memory. The observed behaviour was that
> the memory for the memhog application was allocated off-node since commits
> cd38b11 (mm: page allocator: initialise ZLC for first zone eligible for
> zone_reclaim) and commit 76d3fbf (mm: page allocator: reconsider zones
> for allocation after direct reclaim).
> 
> The assumption of those patches was that it was always preferable to
> allocate quickly than stall for long periods of time and they were
> meant to take care that the zone was only marked full when necessary but
> an important case was missed.
> 
> In the allocator fast path, only the low watermarks are checked. If the
> zones free pages are between the low and min watermark then allocations
> from the allocators slow path will succeed. However, zone_reclaim
> will only reclaim SWAP_CLUSTER_MAX or 1< guarantee that this will meet the low watermark causing the zone to be
> marked full prematurely.
> 
> This patch will only mark the zone full after zone_reclaim if it the min
> watermarks are checked or if page reclaim failed to make sufficient
> progress.
> 
> Reported-and-tested-by: Hedi Berriche 
> Signed-off-by: Mel Gorman 

Reviewed-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fcced7..adce823 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1940,9 +1940,24 @@ zonelist_scan:
>   continue;
>   default:
>   /* did we reclaim enough */
> - if (!zone_watermark_ok(zone, order, mark,
> + if (zone_watermark_ok(zone, order, mark,
>   classzone_idx, alloc_flags))
> + goto try_this_zone;
> +
> + /*
> +  * Failed to reclaim enough to meet watermark.
> +  * Only mark the zone full if checking the min
> +  * watermark or if we failed to reclaim just
> +  * 1< +  * fastpath will prematurely mark zones full
> +  * when the watermark is between the low and
> +  * min watermarks.
> +  */
> + if ((alloc_flags & ALLOC_WMARK_MIN) ||
> +     ret == ZONE_RECLAIM_SOME)
>   goto this_zone_full;
> +
> + continue;
>   }
>   }
>  

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, hugetlb: include hugepages in meminfo

2013-03-20 Thread Michal Hocko

On Wed 20-03-13 11:46:12, David Rientjes wrote:
> On Wed, 20 Mar 2013, Michal Hocko wrote:
> 
> > On Tue 19-03-13 17:18:12, David Rientjes wrote:
> > > Particularly in oom conditions, it's troublesome that hugetlb memory is 
> > > not displayed.  All other meminfo that is emitted will not add up to what 
> > > is expected, and there is no artifact left in the kernel log to show that 
> > > a potentially significant amount of memory is actually allocated as 
> > > hugepages which are not available to be reclaimed.
> > 
> > Yes, I like the idea. It's bitten me already in the past.
> > 
> > The only objection I have is that you print only default_hstate. You
> > just need to wrap your for_each_node_state by for_each_hstate to do
> > that.  With that applied, feel free to add my
> > Acked-by: Michal Hocko 
> > 
> 
> I didn't do this because it isn't already exported in /proc/meminfo and 
> since we've made an effort to reduce the amount of information emitted by 
> the oom killer at oom kill time to avoid spamming the kernel log, I only 
> print the default hstate.

I do not see how this would make the output too much excessive. If
you do not want to have too many lines in the output then the hstate
loop can be pushed inside the node loop and have only per-node number
of lines same as you are proposing except you would have a complete
information.
Besides that we are talking about handful of hstates.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, hugetlb: include hugepages in meminfo

2013-03-20 Thread Michal Hocko

On Wed 20-03-13 11:58:20, David Rientjes wrote:
> On Wed, 20 Mar 2013, Michal Hocko wrote:
> 
> > > I didn't do this because it isn't already exported in /proc/meminfo and 
> > > since we've made an effort to reduce the amount of information emitted by 
> > > the oom killer at oom kill time to avoid spamming the kernel log, I only 
> > > print the default hstate.
> > 
> > I do not see how this would make the output too much excessive. If
> > you do not want to have too many lines in the output then the hstate
> > loop can be pushed inside the node loop and have only per-node number
> > of lines same as you are proposing except you would have a complete
> > information.
> > Besides that we are talking about handful of hstates.
> > 
> 
> Sigh.  Because nobody is going to be mapping non-default hstates and then 
> not know about them at oom time;

If you are under control of the machine then you are right. But I was
already handling issues where getting any piece of information was
challenging and having this kind of information in the log would save me
a lot of time.

> 1GB hugepages on x86 with pse must be reserved at boot and never
> freed, for example.  I'll add them but it's just a waste of time.

If you feel it is the waste of _your_ time then I am OK to create a folow
up patch. I really do not see any reason to limit this output,
especially when it doesn't cost us much.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v2] mm, hugetlb: include hugepages in meminfo

2013-03-20 Thread Michal Hocko

On Wed 20-03-13 12:07:29, David Rientjes wrote:
> mm, hugetlb: include hugepages in meminfo
> 
> Particularly in oom conditions, it's troublesome that hugetlb memory is 
> not displayed.  All other meminfo that is emitted will not add up to what 
> is expected, and there is no artifact left in the kernel log to show that 
> a potentially significant amount of memory is actually allocated as 
> hugepages which are not available to be reclaimed.
> 
> Booting with hugepages=8192 on the command line, this memory is now shown 
> in oom conditions.  For example, with echo m > /proc/sysrq-trigger:
> 
> Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 
> hugepages_size=2048kB
> 
> Acked-by: Michal Hocko 
> Signed-off-by: David Rientjes 

Thank you!

> ---
>  include/linux/hugetlb.h |  4 
>  mm/hugetlb.c| 15 +++
>  mm/page_alloc.c |  3 +++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -58,6 +58,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>  int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
>  void hugetlb_report_meminfo(struct seq_file *);
>  int hugetlb_report_node_meminfo(int, char *);
> +void hugetlb_show_meminfo(void);
>  unsigned long hugetlb_total_pages(void);
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   unsigned long address, unsigned int flags);
> @@ -114,6 +115,9 @@ static inline void hugetlb_report_meminfo(struct seq_file 
> *m)
>  {
>  }
>  #define hugetlb_report_node_meminfo(n, buf)  0
> +static inline void hugetlb_show_meminfo(void)
> +{
> +}
>  #define follow_huge_pmd(mm, addr, pmd, write)NULL
>  #define follow_huge_pud(mm, addr, pud, write)NULL
>  #define prepare_hugepage_range(file, addr, len)  (-EINVAL)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2121,6 +2121,21 @@ int hugetlb_report_node_meminfo(int nid, char *buf)
>   nid, h->surplus_huge_pages_node[nid]);
>  }
>  
> +void hugetlb_show_meminfo(void)
> +{
> + struct hstate *h;
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY)
> + for_each_hstate(h)
> + pr_info("Node %d hugepages_total=%u hugepages_free=%u 
> hugepages_surp=%u hugepages_size=%lukB\n",
> + nid,
> + h->nr_huge_pages_node[nid],
> + h->free_huge_pages_node[nid],
> + h->surplus_huge_pages_node[nid],
> + 1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
> +}
> +
>  /* Return the number pages of memory we physically have, in PAGE_SIZE units. 
> */
>  unsigned long hugetlb_total_pages(void)
>  {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -58,6 +58,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -3105,6 +3106,8 @@ void show_free_areas(unsigned int filter)
>   printk("= %lukB\n", K(total));
>   }
>  
> + hugetlb_show_meminfo();
> +
>   printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
>  
>   show_swap_cache_info();
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: page_alloc: Avoid marking zones full prematurely after zone_reclaim()

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 10:33:07, Simon Jeons wrote:
> Hi Mel,
> On 03/21/2013 02:19 AM, Mel Gorman wrote:
> >The following problem was reported against a distribution kernel when
> >zone_reclaim was enabled but the same problem applies to the mainline
> >kernel. The reproduction case was as follows
> >
> >1. Run numactl -m +0 dd if=largefile of=/dev/null
> >This allocates a large number of clean pages in node 0
> 
> I confuse why this need allocate a large number of clean pages?

It reads from file and puts pages into the page cache. The pages are not
modified so they are clean. Output file is /dev/null so no pages are
written. dd doesn't call fadvise(POSIX_FADV_DONTNEED) on the input file
by default so pages from the file stay in the page cache
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: page_alloc: Avoid marking zones full prematurely after zone_reclaim()

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 16:32:03, Simon Jeons wrote:
> Hi Michal,
> On 03/21/2013 04:19 PM, Michal Hocko wrote:
> >On Thu 21-03-13 10:33:07, Simon Jeons wrote:
> >>Hi Mel,
> >>On 03/21/2013 02:19 AM, Mel Gorman wrote:
> >>>The following problem was reported against a distribution kernel when
> >>>zone_reclaim was enabled but the same problem applies to the mainline
> >>>kernel. The reproduction case was as follows
> >>>
> >>>1. Run numactl -m +0 dd if=largefile of=/dev/null
> >>>This allocates a large number of clean pages in node 0
> >>I confuse why this need allocate a large number of clean pages?
> >It reads from file and puts pages into the page cache. The pages are not
> >modified so they are clean. Output file is /dev/null so no pages are
> >written. dd doesn't call fadvise(POSIX_FADV_DONTNEED) on the input file
> >by default so pages from the file stay in the page cache
> 
> Thanks for your clarify Michal.

This is getting off-topic.

> dd will use page cache instead of direct IO?

no by default. You can use direct option. Refer to man dd for more
information.

> Where can I got dd source codes?

dd is part of coreutils: http://www.gnu.org/software/coreutils/
Please do not be afraid to use google. Most of these answers are there
already...

> One offline question, when should use page cache and when should use
> direct IO?

And this is really off-topic. The simplest answer would be. Use direct
IO when you want to prevent from caching because you are doing it
yourselvef. Please try to search the web it is full of more specific
examples.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memcg: fix memcg_cache_name() to use cgroup_name()

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 09:22:21, Li Zefan wrote:
> As cgroup supports rename, it's unsafe to dereference dentry->d_name
> without proper vfs locks. Fix this by using cgroup_name().
> 
> Signed-off-by: Li Zefan 
> ---
> 
> This patch depends on "cgroup: fix cgroup_path() vs rename() race",
> which has been queued for 3.10.
> 
> ---
>  mm/memcontrol.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53b8201..72be5c9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3217,17 +3217,16 @@ void mem_cgroup_destroy_cache(struct kmem_cache 
> *cachep)
>  static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache *s)
>  {
>   char *name;
> - struct dentry *dentry;
> +
> + name = (char *)__get_free_page(GFP_TEMPORARY);

Ouch. Can we use a static temporary buffer instead? This is called from
workqueue context so we do not have to be afraid of the deep call chain.
It is also not a hot path AFAICS.

Even GFP_ATOMIC for kasprintf would be an improvement IMO.

> + if (!name)
> + return NULL;
>  
>   rcu_read_lock();
> - dentry = rcu_dereference(memcg->css.cgroup->dentry);
> + snprintf(name, PAGE_SIZE, "%s(%d:%s)", s->name, memcg_cache_id(memcg),
> +  cgroup_name(memcg->css.cgroup));
>   rcu_read_unlock();
>  
> - BUG_ON(dentry == NULL);
> -
> - name = kasprintf(GFP_KERNEL, "%s(%d:%s)", s->name,
> -  memcg_cache_id(memcg), dentry->d_name.name);
> -
>   return name;
>  }
>  
> @@ -3247,7 +3246,7 @@ static struct kmem_cache *kmem_cache_dup(struct 
> mem_cgroup *memcg,
>   if (new)
>   new->allocflags |= __GFP_KMEMCG;
>  
> - kfree(name);
> + free_page((unsigned long)name);
>   return new;
>  }
>  
> -- 
> 1.8.0.2
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memcg: fix memcg_cache_name() to use cgroup_name()

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 10:08:49, Michal Hocko wrote:
> On Thu 21-03-13 09:22:21, Li Zefan wrote:
> > As cgroup supports rename, it's unsafe to dereference dentry->d_name
> > without proper vfs locks. Fix this by using cgroup_name().
> > 
> > Signed-off-by: Li Zefan 
> > ---
> > 
> > This patch depends on "cgroup: fix cgroup_path() vs rename() race",
> > which has been queued for 3.10.
> > 
> > ---
> >  mm/memcontrol.c | 15 +++
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 53b8201..72be5c9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3217,17 +3217,16 @@ void mem_cgroup_destroy_cache(struct kmem_cache 
> > *cachep)
> >  static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache 
> > *s)
> >  {
> > char *name;
> > -   struct dentry *dentry;
> > +
> > +   name = (char *)__get_free_page(GFP_TEMPORARY);
> 
> Ouch. Can we use a static temporary buffer instead?

> This is called from workqueue context so we do not have to be afraid
> of the deep call chain.

Bahh, I was thinking about two things at the same time and that is how
it ends... I meant a temporary buffer on the stack. But a separate
allocation sounds even easier.

> It is also not a hot path AFAICS.
> 
> Even GFP_ATOMIC for kasprintf would be an improvement IMO.

What about the following (not even compile tested because I do not have
cgroup_name in my tree yet):
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f608546..ede0382 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3370,13 +3370,18 @@ static char *memcg_cache_name(struct mem_cgroup *memcg, 
struct kmem_cache *s)
struct dentry *dentry;
 
rcu_read_lock();
-   dentry = rcu_dereference(memcg->css.cgroup->dentry);
+   name = kasprintf(GFP_ATOMIC, "%s(%d:%s)", s->name,
+memcg_cache_id(memcg), cgroup_name(memcg->css.cgroup));
rcu_read_unlock();
 
-   BUG_ON(dentry == NULL);
-
-   name = kasprintf(GFP_KERNEL, "%s(%d:%s)", s->name,
-memcg_cache_id(memcg), dentry->d_name.name);
+   if (!name) {
+   name = kmalloc(PAGE_SIZE, GFP_KERNEL);
+   rcu_read_lock();
+   name = snprintf(name, PAGE_SIZE, "%s(%d:%s)", s->name,
+   memcg_cache_id(memcg),
+   cgroup_name(memcg->css.cgroup));
+   rcu_read_unlock();
+   }
 
return name;
 }
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm: speedup in __early_pfn_to_nid

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 11:55:16, Ingo Molnar wrote:
> 
> * Russ Anderson  wrote:
> 
> > When booting on a large memory system, the kernel spends
> > considerable time in memmap_init_zone() setting up memory zones.
> > Analysis shows significant time spent in __early_pfn_to_nid().
> > 
> > The routine memmap_init_zone() checks each PFN to verify the
> > nid is valid.  __early_pfn_to_nid() sequentially scans the list of
> > pfn ranges to find the right range and returns the nid.  This does
> > not scale well.  On a 4 TB (single rack) system there are 308
> > memory ranges to scan.  The higher the PFN the more time spent
> > sequentially spinning through memory ranges.
> > 
> > Since memmap_init_zone() increments pfn, it will almost always be
> > looking for the same range as the previous pfn, so check that
> > range first.  If it is in the same range, return that nid.
> > If not, scan the list as before.
> > 
> > A 4 TB (single rack) UV1 system takes 512 seconds to get through
> > the zone code.  This performance optimization reduces the time
> > by 189 seconds, a 36% improvement.
> > 
> > A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> > a 112.9 second (53%) reduction.
> 
> Nice speedup!
> 
> A minor nit, in addition to Andrew's suggestion about wrapping 
> __early_pfn_to_nid():
> 
> > Index: linux/mm/page_alloc.c
> > ===
> > --- linux.orig/mm/page_alloc.c  2013-03-18 10:52:11.510988843 -0500
> > +++ linux/mm/page_alloc.c   2013-03-18 10:52:14.214931348 -0500
> > @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> >  {
> > unsigned long start_pfn, end_pfn;
> > int i, nid;
> > +   static unsigned long last_start_pfn, last_end_pfn;
> > +   static int last_nid;
> 
> Please move these globals out of function local scope, to make it more 
> apparent that they are not on-stack. I only noticed it in the second pass.

Wouldn't this just add more confision with other _pfn variables? (e.g.
{min,max}_low_pfn and others)

IMO the local scope is more obvious as this is and should only be used
for caching purposes.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/9] extend hugepage migration

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 07:49:48, Simon Jeons wrote:
[...]
> When I hacking arch/x86/mm/hugetlbpage.c like this,
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index ae1aa71..87f34ee 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -354,14 +354,13 @@ hugetlb_get_unmapped_area(struct file *file,
> unsigned long addr,
> 
> #endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
> 
> -#ifdef CONFIG_X86_64
> static __init int setup_hugepagesz(char *opt)
> {
> unsigned long ps = memparse(opt, &opt);
> if (ps == PMD_SIZE) {
> hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
> - } else if (ps == PUD_SIZE && cpu_has_gbpages) {
> - hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
> + } else if (ps == PUD_SIZE) {
> + hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT+4);
> } else {
> printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
> ps >> 20);
> 
> I set boot=hugepagesz=1G hugepages=10, then I got 10 32MB huge pages.
> What's the difference between these pages which I hacking and normal
> huge pages?

How is this related to the patch set?
Please _stop_ distracting discussion to unrelated topics!

Nothing personal but this is just wasting our time.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 09:47:13, Mel Gorman wrote:
> On Wed, Mar 20, 2013 at 05:18:47PM +0100, Michal Hocko wrote:
> > On Sun 17-03-13 13:04:07, Mel Gorman wrote:
> > [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 88c5fed..4835a7a 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, 
> > > int order, long remaining,
> > >  }
> > >  
> > >  /*
> > > + * kswapd shrinks the zone by the number of pages required to reach
> > > + * the high watermark.
> > > + */
> > > +static void kswapd_shrink_zone(struct zone *zone,
> > > +struct scan_control *sc,
> > > +unsigned long lru_pages)
> > > +{
> > > + unsigned long nr_slab;
> > > + struct reclaim_state *reclaim_state = current->reclaim_state;
> > > + struct shrink_control shrink = {
> > > + .gfp_mask = sc->gfp_mask,
> > > + };
> > > +
> > > + /* Reclaim above the high watermark. */
> > > + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> > 
> > OK, so the cap is at high watermark which sounds OK to me, although I
> > would expect balance_gap being considered here. Is it not used
> > intentionally or you just wanted to have a reasonable upper bound?
> > 
> 
> It's intentional. The balance_gap is taken into account before the
> decision to shrink but not afterwards. As the watermark check after
> shrinking is based on just the high watermark, I decided to have
> shrink_zone reclaim on that basis.

OK, it makes sense. Thanks both you and Rik for clarification.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

2013-03-21 Thread Michal Hocko

 min = nr[l];
> +
> + /* Normalise the scan counts so kswapd scans proportionally */
> + for_each_evictable_lru(l)
> + nr[l] -= min;
> + }

It looked scary at first glance but it makes sense. Every round (after we
have reclaimed enough) one LRU is pulled out and others are
proportionally inhibited.

> +}
> +
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct 
> reclaim.
>   */
> @@ -1841,17 +1880,8 @@ static void shrink_lruvec(struct lruvec *lruvec, 
> struct scan_control *sc)
>   lruvec, sc);
>   }
>   }
> - /*
> -  * On large memory systems, scan >> priority can become
> -  * really large. This is fine for the starting priority;
> -  * we want to put equal scanning pressure on each zone.
> -  * However, if the VM has a harder time of freeing pages,
> -  * with multiple processes reclaiming pages, the total
> -  * freeing target can get unreasonably large.
> -  */
> - if (nr_reclaimed >= nr_to_reclaim &&
> - sc->priority < DEF_PRIORITY)
> - break;
> +
> + recalculate_scan_count(nr_reclaimed, nr_to_reclaim, nr);
>   }
>   blk_finish_plug(&plug);
>   sc->nr_reclaimed += nr_reclaimed;
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

2013-03-21 Thread Michal Hocko

then recheck the watermarks at order-0 to
> +  * prevent kswapd reclaiming excessively. Assume that a
> +  * process requested a high-order can direct reclaim/compact.
>*/
> - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> - break;
> - } while (--sc.priority >= 0);
> + if (order && sc.nr_reclaimed >= 2UL << order)
> + order = sc.order = 0;
>  
> -out:
> - if (!pgdat_is_balanced) {
> - cond_resched();
> + /* Check if kswapd should be suspending */
> + if (try_to_freeze() || kthread_should_stop())
> + break;
>  
> - try_to_freeze();
> + /* If no reclaim progress then increase scanning priority */
> + if (sc.nr_reclaimed - nr_reclaimed == 0)
> + raise_priority = true;
>  
>   /*
> -  * Fragmentation may mean that the system cannot be
> -  * rebalanced for high-order allocations in all zones.
> -  * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> -  * it means the zones have been fully scanned and are still
> -  * not balanced. For high-order allocations, there is
> -  * little point trying all over again as kswapd may
> -  * infinite loop.
> -  *
> -  * Instead, recheck all watermarks at order-0 as they
> -  * are the most important. If watermarks are ok, kswapd will go
> -  * back to sleep. High-order users can still perform direct
> -  * reclaim if they wish.
> +  * Raise priority if scanning rate is too low or there was no
> +  * progress in reclaiming pages
>*/
> - if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> - order = sc.order = 0;
> -
> - goto loop_again;
> - }
> + if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)

(sc.nr_reclaimed - nr_reclaimed == 0) is redundant because you already
set raise_priority above in that case.

> + sc.priority--;
> + } while (sc.priority >= 0 &&
> +  !pgdat_balanced(pgdat, order, *classzone_idx));
>  
>   /*
>    * If kswapd was reclaiming at a higher order, it has the option of
> @@ -2907,6 +2904,7 @@ out:
>   compact_pgdat(pgdat, order);
>   }
>  
> +out:
>   /*
>* Return the order we were reclaiming at so prepare_kswapd_sleep()
>* makes a decision on the order we were last reclaiming at. However,

It looks OK otherwise but I have to think some more as balance_pgdat is
still tricky, albeit less then it was before so this is definitely
progress.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 14:31:15, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 03:01:54PM +0100, Michal Hocko wrote:
> > On Sun 17-03-13 13:04:08, Mel Gorman wrote:
> > > Simplistically, the anon and file LRU lists are scanned proportionally
> > > depending on the value of vm.swappiness although there are other factors
> > > taken into account by get_scan_count().  The patch "mm: vmscan: Limit
> > > the number of pages kswapd reclaims" limits the number of pages kswapd
> > > reclaims but it breaks this proportional scanning and may evenly shrink
> > > anon/file LRUs regardless of vm.swappiness.
> > > 
> > > This patch preserves the proportional scanning and reclaim. It does mean
> > > that kswapd will reclaim more than requested but the number of pages will
> > > be related to the high watermark.
> > > 
> > > Signed-off-by: Mel Gorman 
> > > ---
> > >  mm/vmscan.c | 52 +---
> > >  1 file changed, 41 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4835a7a..182ff15 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1815,6 +1815,45 @@ out:
> > >   }
> > >  }
> > >  
> > > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > > + unsigned long nr_to_reclaim,
> > > + unsigned long nr[NR_LRU_LISTS])
> > > +{
> > > + enum lru_list l;
> > > +
> > > + /*
> > > +  * For direct reclaim, reclaim the number of pages requested. Less
> > > +  * care is taken to ensure that scanning for each LRU is properly
> > > +  * proportional. This is unfortunate and is improper aging but
> > > +  * minimises the amount of time a process is stalled.
> > > +  */
> > > + if (!current_is_kswapd()) {
> > > + if (nr_reclaimed >= nr_to_reclaim) {
> > > + for_each_evictable_lru(l)
> > > + nr[l] = 0;
> > > + }
> > > + return;
> > 
> > Heh, this is nicely cryptically said what could be done in shrink_lruvec
> > as
> > if (!current_is_kswapd()) {
> > if (nr_reclaimed >= nr_to_reclaim)
> > break;
> > }
> > 
> 
> Pretty much. At one point during development, this function was more
> complex and it evolved into this without me rechecking if splitting it
> out still made sense.
> 
> > Besides that this is not memcg aware which I think it would break
> > targeted reclaim which is kind of direct reclaim but it still would be
> > good to stay proportional because it starts with DEF_PRIORITY.
> > 
> 
> This does break memcg because it's a special sort of direct reclaim.
> 
> > I would suggest moving this back to shrink_lruvec and update the test as
> > follows:
> 
> I also noticed that we check whether the scan counts need to be
> normalised more than once

I didn't mind this because it "disqualified" at least one LRU every
round which sounds reasonable to me because all LRUs would be scanned
proportionally. E.g. if swappiness is 0 then nr[anon] would be 0 and
then the active/inactive aging would break? Or am I missing something?

> and this reshuffling checks nr_reclaimed twice. How about this?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 182ff15..320a2f4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1815,45 +1815,6 @@ out:
>   }
>  }
>  
> -static void recalculate_scan_count(unsigned long nr_reclaimed,
> - unsigned long nr_to_reclaim,
> - unsigned long nr[NR_LRU_LISTS])
> -{
> - enum lru_list l;
> -
> - /*
> -  * For direct reclaim, reclaim the number of pages requested. Less
> -  * care is taken to ensure that scanning for each LRU is properly
> -  * proportional. This is unfortunate and is improper aging but
> -  * minimises the amount of time a process is stalled.
> -  */
> - if (!current_is_kswapd()) {
> - if (nr_reclaimed >= nr_to_reclaim) {
> - for_each_evictable_lru(l)
> - nr[l] = 0;
> - }
> - return;
> - }
> -
> - /*
> -  * For kswapd, reclaim at least the number of pages requested.
> -  * However, ensure that LRUs shrink by the proportion requested
> -  * by get_scan_count() so vm.swappiness is obeyed.
> -  */
> - if (nr_reclaimed >= nr_to_reclaim) {
> - unsig

Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

2013-03-21 Thread Michal Hocko

 and
> -  * that the congestion flags are cleared. The congestion flag must
> -  * be cleared as kswapd is the only mechanism that clears the flag
> -  * and it is potentially going to sleep here.
> -  */
> - if (order) {
> - int zones_need_compaction = 1;
> -
> - for (i = 0; i <= end_zone; i++) {
> - struct zone *zone = pgdat->node_zones + i;
> -
> - if (!populated_zone(zone))
> - continue;
> -
> - /* Check if the memory needs to be defragmented. */
> - if (zone_watermark_ok(zone, order,
> -     low_wmark_pages(zone), *classzone_idx, 0))
> - zones_need_compaction = 0;
> - }
> -
> - if (zones_need_compaction)
> - compact_pgdat(pgdat, order);
> - }
> -
>  out:
>   /*
>* Return the order we were reclaiming at so prepare_kswapd_sleep()
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 15:26:02, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 03:54:58PM +0100, Michal Hocko wrote:
> > > Signed-off-by: Mel Gorman 
> > > ---
> > >  mm/vmscan.c | 86 
> > > ++---
> > >  1 file changed, 42 insertions(+), 44 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 182ff15..279d0c2 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2625,8 +2625,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, 
> > > int order, long remaining,
> > >  /*
> > >   * kswapd shrinks the zone by the number of pages required to reach
> > >   * the high watermark.
> > > + *
> > > + * Returns true if kswapd scanned at least the requested number of
> > > + * pages to reclaim.
> > 
> > Maybe move the comment about not rising priority in such case here to be
> > clear what the return value means. Without that the return value could
> > be misinterpreted that kswapd_shrink_zone succeeded in shrinking might
> > be not true.
> 
> I moved the comment.

Thanks

> > Or maybe even better, leave the void there and add bool *raise_priority
> > argument here so the decision and raise_priority are at the same place.
> > 
> 
> The priority is raised if kswapd failed to reclaim from any of the unbalanced
> zone. If raise_priority is moved inside kswapd_shrink_zone then it can
> only take one zone into account.

Right you are. I am blind.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority

2013-03-21 Thread Michal Hocko

On Sun 17-03-13 13:04:11, Mel Gorman wrote:
> Page reclaim at priority 0 will scan the entire LRU as priority 0 is
> considered to be a near OOM condition. Kswapd can reach priority 0 quite
> easily if it is encountering a large number of pages it cannot reclaim
> such as pages under writeback. When this happens, kswapd reclaims very
> aggressively even though there may be no real risk of allocation failure
> or OOM.
> 
> This patch prevents kswapd reaching priority 0 and trying to reclaim
> the world. Direct reclaimers will still reach priority 0 in the event
> of an OOM situation.

OK, it should work. raise_priority should prevent from pointless
lowerinng the priority and if there is really nothing to reclaim then
relying on the direct reclaim is probably a better idea.

> Signed-off-by: Mel Gorman 

Reviewed-by: Michal Hocko 

> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7513bd1..af3bb6f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2891,7 +2891,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
> int order,
>*/
>   if (raise_priority || !this_reclaimed)
>   sc.priority--;
> - } while (sc.priority >= 0 &&
> + } while (sc.priority >= 1 &&
>        !pgdat_balanced(pgdat, order, *classzone_idx));
>  
>  out:
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

2013-03-21 Thread Michal Hocko

On Thu 21-03-13 15:47:31, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 04:32:31PM +0100, Michal Hocko wrote:
> > On Sun 17-03-13 13:04:10, Mel Gorman wrote:
> > > In the past, kswapd makes a decision on whether to compact memory after 
> > > the
> > > pgdat was considered balanced. This more or less worked but it is late to
> > > make such a decision and does not fit well now that kswapd makes a 
> > > decision
> > > whether to exit the zone scanning loop depending on reclaim progress.
> > > 
> > > This patch will compact a pgdat if at least  the requested number of pages
> > > were reclaimed from unbalanced zones for a given priority. If any zone is
> > > currently balanced, kswapd will not call compaction as it is expected the
> > > necessary pages are already available.
> > > 
> > > Signed-off-by: Mel Gorman 
> > > ---
> > >  mm/vmscan.c | 52 +---
> > >  1 file changed, 21 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 279d0c2..7513bd1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2694,8 +2694,11 @@ static unsigned long balance_pgdat(pg_data_t 
> > > *pgdat, int order,
> > >  
> > >   do {
> > >   unsigned long lru_pages = 0;
> > > + unsigned long nr_to_reclaim = 0;
> > >   unsigned long nr_reclaimed = sc.nr_reclaimed;
> > > + unsigned long this_reclaimed;
> > >   bool raise_priority = true;
> > > + bool pgdat_needs_compaction = true;
> > 
> > I am confused. We don't want to compact for order == 0, do we?
> > 
> 
> No, but an order check is made later which I felt it was clearer.  You are
> the second person to bring it up so I'll base the initialisation on order.

Dohh. Yes compact_pgdat is called only if order != 0. I was so focused
on pgdat_needs_compaction that I've missed it. Both checks use (order &&
pgdat_needs_compaction) so initialization based on order would be
probably better for readability.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/10 -v2r1] mm: vmscan: Block kswapd if it is encountering pages under writeback

2013-03-21 Thread Michal Hocko

Here is what you have in your mm-vmscan-limit-reclaim-v2r1 branch:
> commit 0dae7d4be56e6a7fe3f128284679f5efc0cc2383
> Author: Mel Gorman 
> Date:   Tue Mar 12 10:33:31 2013 +
> 
> mm: vmscan: Block kswapd if it is encountering pages under writeback
> 
> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
> 
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
> 
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it
> would cause direct reclaimers to stall on writeback which is not 
> desirable.
> 
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.
> 
> Signed-off-by: Mel Gorman 

Looks reasonable to me.
Reviewed-by: Michal Hocko 

> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index afedd1d..dd0d266 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -499,6 +499,9 @@ typedef enum {
>* many dirty file pages at the tail
>* of the LRU.
>*/
> + ZONE_WRITEBACK, /* reclaim scanning has recently found
> +  * many pages under writeback
> +  */
>  } zone_flags_t;
>  
>  static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
> @@ -526,6 +529,11 @@ static inline int zone_is_reclaim_dirty(const struct 
> zone *zone)
>   return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
>  }
>  
> +static inline int zone_is_reclaim_writeback(const struct zone *zone)
> +{
> + return test_bit(ZONE_WRITEBACK, &zone->flags);
> +}
> +
>  static inline int zone_is_reclaim_locked(const struct zone *zone)
>  {
>   return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a8b94fa..e87de90 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -723,25 +723,51 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
>   (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  
> + /*
> +  * If a page at the tail of the LRU is under writeback, there
> +  * are three cases to consider.
> +  *
> +  * 1) If reclaim is encountering an excessive number of pages
> +  *under writeback and this page is both under writeback and
> +  *PageReclaim then it indicates that pages are being queued
> +  *for IO but are being recycled through the LRU before the
> +  *IO can complete. In this case, wait on the IO to complete
> +  *and then clear the ZONE_WRITEBACK flag to recheck if the
> +  *condition exists.
> +  *
> +  * 2) Global reclaim encounters a page, memcg encounters a
> +  *page that is not marked for immediate reclaim or
> +  *the caller does not have __GFP_IO. In this case mark
> +  *the page for immediate reclaim and continue scanning.
> +  *
> +  *__GFP_IO is checked  because a loop driver thread might
> +  *enter reclaim, and deadlock if it waits on a page for
> +  *which it is needed to do the write (loop masks off
> +  *__GFP_IO|__GFP_FS for this reason); but more thought
> +  *would probably show mo

Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-03 Thread Michal Hocko

On Sun 03-02-13 20:29:01, Hugh Dickins wrote:
> Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
> I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
> "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
> used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Michal Hocko 

Hmm, if you are not too tired then moving the function downwards to
where it is called (memcg_init_kmem) will reduce the number of ifdefs.
But this can wait for a bigger clean up which is getting due:
git grep "def.*CONFIG_MEMCG_KMEM" mm/memcontrol.c | wc -l
12

Thanks
> ---
> 
>  mm/memcontrol.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- 3.8-rc6/mm/memcontrol.c   2012-12-22 09:43:27.628015582 -0800
> +++ linux/mm/memcontrol.c 2013-02-02 16:56:06.188325771 -0800
> @@ -4969,6 +4969,7 @@ out:
>   return ret;
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
>  static int memcg_propagate_kmem(struct mem_cgroup *memcg)
>  {
>   int ret = 0;
> @@ -4977,7 +4978,6 @@ static int memcg_propagate_kmem(struct m
>   goto out;
>  
>   memcg->kmem_account_flags = parent->kmem_account_flags;
> -#ifdef CONFIG_MEMCG_KMEM
>   /*
>* When that happen, we need to disable the static branch only on those
>* memcgs that enabled it. To achieve this, we would be forced to
> @@ -5003,10 +5003,10 @@ static int memcg_propagate_kmem(struct m
>   mutex_lock(&set_limit_mutex);
>   ret = memcg_update_cache_sizes(memcg);
>   mutex_unlock(&set_limit_mutex);
> -#endif
>  out:
>   return ret;
>  }
> +#endif /* CONFIG_MEMCG_KMEM */
>  
>  /*
>   * The user of this function is...

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-04 Thread Michal Hocko

On Mon 04-02-13 12:04:06, Glauber Costa wrote:
> On 02/04/2013 11:57 AM, Michal Hocko wrote:
> > On Sun 03-02-13 20:29:01, Hugh Dickins wrote:
> >> Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
> >> I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
> >> "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
> >> used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
> >>
> >> Signed-off-by: Hugh Dickins 
> > 
> > Acked-by: Michal Hocko 
> > 
> > Hmm, if you are not too tired then moving the function downwards to
> > where it is called (memcg_init_kmem) will reduce the number of ifdefs.
> > But this can wait for a bigger clean up which is getting due:
> > git grep "def.*CONFIG_MEMCG_KMEM" mm/memcontrol.c | wc -l
> > 12
> > 
> 
> The problem is that I was usually keeping things in clearly separated
> blocks, like this :
> 
> #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> struct tcp_memcontrol tcp_mem;
> #endif
> #if defined(CONFIG_MEMCG_KMEM)
> /* analogous to slab_common's slab_caches list. per-memcg */
> struct list_head memcg_slab_caches;
> /* Not a spinlock, we can take a lot of time walking the list */
> struct mutex slab_caches_mutex;
> /* Index in the kmem_cache->memcg_params->memcg_caches array */
> int kmemcg_id;
> #endif
> 
> If it would be preferable to everybody, this could be easily rewritten as:
> 
> #if defined(CONFIG_MEMCG_KMEM)
> #if defined(CONFIG_INET)
> struct tcp_memcontrol tcp_mem;
> #endif
> /* analogous to slab_common's slab_caches list. per-memcg */
> struct list_head memcg_slab_caches;
> /* Not a spinlock, we can take a lot of time walking the list */
> struct mutex slab_caches_mutex;
> /* Index in the kmem_cache->memcg_params->memcg_caches array */
> int kmemcg_id;
> #endif

I was rather interested in reducing CONFIG_KMEM block, the above example
doesn't bother me that much.
 
> This would allow us to collapse some blocks a bit down as well.
> 
> It doesn't bother me *that* much, though.

Yes and a quick attempt shows that a clean up would bring a lot of
churn.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko

On Fri 25-01-13 17:31:30, Michal Hocko wrote:
> On Fri 25-01-13 16:07:23, azurIt wrote:
> > Any news? Thnx!
> 
> Sorry, but I didn't get to this one yet.

Sorry, to get back to this that late but I was busy as hell since the
beginning of the year.

Has the issue repeated since then?

You said you didn't apply other than the above mentioned patch. Could
you apply also debugging part of the patches I have sent?
In case you don't have it handy then it should be this one:
---
>From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 3 Dec 2012 16:16:01 +0100
Subject: [PATCH] more debugging

---
 mm/huge_memory.c |6 +++---
 mm/memcontrol.c  |1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 470cbb4..01a11f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
+noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
   unsigned long address, pmd_t *pmd,
   unsigned int flags)
 {
@@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
return pgtable;
 }
 
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmd, pmd_t orig_pmd,
@@ -883,7 +883,7 @@ out_free_pages:
goto out;
 }
 
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
 {
int ret = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..1986c65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,6 +2397,7 @@ done:
return 0;
 nomem:
*ptr = NULL;
+   __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, 
nr_pages, oom, ret);
return -ENOMEM;
 bypass:
    *ptr = NULL;
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko

On Tue 05-02-13 15:49:47, azurIt wrote:
[...]
> Just to be sure - am i supposed to apply this two patches?
> http://watchdog.sk/lkml/patches/

5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
mentioned in a follow up email. Here is the full patch:
---
>From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

Process B
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because no user
accounted allocation use this flag except for THP which have memcg oom
disabled already.

Reported-by: azurIt 
Signed-off-by: Michal Hocko 
---
 include/linux/gfp.h|3 +++
 include/linux/memcontrol.h |   13 +
 mm/filemap.c   |8 +++-
 mm/memcontrol.c|   10 ++
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..806fb54 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -146,6 +146,9 @@ struct vm_area_struct;
 /* 4GB DMA on some platforms */
 #define GFP_DMA32  __GFP_DMA32

+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM   __GFP_NORETRY
+
 /* Convert GFP flags to their corresponding migrate type */
 static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 81572af..bf0e575 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct 
mem_cgroup *ptr);

 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+   struct mm_struct *mm, gfp_t gfp_mask)
+{
+   return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
@@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page 
*page,
return 0;
 }

+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+   struct mm_struct *mm, gfp_t gfp_mask)
+{
+   return 0;
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 556858c..ef182a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct 
address_space *mapping,

[PATCH 1/3] memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init

2013-02-05 Thread Michal Hocko

Per-node-zone soft limit tree is currently initialized when the root
cgroup is created which is OK but it pointlessly pollutes memcg
allocation code with something that can be called when the memcg
subsystem is initialized by mem_cgroup_init along with other controller
specific parts.

While we are at it let's make mem_cgroup_soft_limit_tree_init void
because it doesn't make much sense to report memory failure because if
we fail to allocate memory that early during the boot then we are
screwed anyway (this saves some code).

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   19 +++
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2382fe9..b0d3339 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6228,7 +6228,7 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup 
*memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
-static int mem_cgroup_soft_limit_tree_init(void)
+static void __init mem_cgroup_soft_limit_tree_init(void)
 {
struct mem_cgroup_tree_per_node *rtpn;
struct mem_cgroup_tree_per_zone *rtpz;
@@ -6239,8 +6239,7 @@ static int mem_cgroup_soft_limit_tree_init(void)
if (!node_state(node, N_NORMAL_MEMORY))
tmp = -1;
rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
-   if (!rtpn)
-   goto err_cleanup;
+   BUG_ON(!rtpn);
 
soft_limit_tree.rb_tree_per_node[node] = rtpn;
 
@@ -6250,17 +6249,6 @@ static int mem_cgroup_soft_limit_tree_init(void)
spin_lock_init(&rtpz->lock);
}
}
-   return 0;
-
-err_cleanup:
-   for_each_node(node) {
-   if (!soft_limit_tree.rb_tree_per_node[node])
-   break;
-   kfree(soft_limit_tree.rb_tree_per_node[node]);
-   soft_limit_tree.rb_tree_per_node[node] = NULL;
-   }
-   return 1;
-
 }
 
 static struct cgroup_subsys_state * __ref
@@ -6282,8 +6270,6 @@ mem_cgroup_css_alloc(struct cgroup *cont)
if (cont->parent == NULL) {
int cpu;
 
-   if (mem_cgroup_soft_limit_tree_init())
-   goto free_out;
root_mem_cgroup = memcg;
for_each_possible_cpu(cpu) {
struct memcg_stock_pcp *stock =
@@ -7027,6 +7013,7 @@ static int __init mem_cgroup_init(void)
 {
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
enable_swap_cgroup();
+   mem_cgroup_soft_limit_tree_init();
return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] memcg: cleanup mem_cgroup_init comment

2013-02-05 Thread Michal Hocko

We should encourage all memcg controller initialization independent on
a specific mem_cgroup to be done here rather than exploit css_alloc
callback and assume that nothing happens before root cgroup is created.

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9c1690..b97008c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7007,10 +7007,12 @@ static void __init enable_swap_cgroup(void)
 #endif
 
 /*
- * The rest of init is performed during ->css_alloc() for root css which
- * happens before initcalls.  hotcpu_notifier() can't be done together as
- * it would introduce circular locking by adding cgroup_lock -> cpu hotplug
- * dependency.  Do it from a subsys_initcall().
+ * subsys_initcall() for memory controller.
+ *
+ * Some parts like hotcpu_notifier() have to be initialized from this context
+ * because of lock dependencies (cgroup_lock -> cpu hotplug) but basically
+ * everything that doesn't depend on a specific mem_cgroup structure should
+ * be initialized from here.
  */
 static int __init mem_cgroup_init(void)
 {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] cleanup memcg controller initialization

2013-02-05 Thread Michal Hocko

Hi,
this is just a small cleanup I promised some time ago[1]. It just moves
all memcg controller initialization code independant on mem_cgroup into
subsystem initialization code.

There are no functional changes.

Diffstat even says that we have saved some lines.
 mm/memcontrol.c |   49 +
 1 file changed, 21 insertions(+), 28 deletions(-)

Shortlog says:
Michal Hocko (3):
  memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init
  memcg: move memcg_stock initialization to mem_cgroup_init
  memcg: cleanup mem_cgroup_init comment
---
[1] https://lkml.org/lkml/2012/12/18/256
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] memcg: move memcg_stock initialization to mem_cgroup_init

2013-02-05 Thread Michal Hocko

memcg_stock are currently initialized during the root cgroup allocation
which is OK but it pointlessly pollutes memcg allocation code with
something that can be called when the memcg subsystem is initialized by
mem_cgroup_init along with other controller specific parts.

This patch wrappes the current memcg_stock initialization code into a
helper calls it from the controller subsystem initialization code.

Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b0d3339..e9c1690 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2362,6 +2362,17 @@ static void drain_local_stock(struct work_struct *dummy)
clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 }
 
+static void __init memcg_stock_init(void)
+{
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   struct memcg_stock_pcp *stock =
+   &per_cpu(memcg_stock, cpu);
+   INIT_WORK(&stock->work, drain_local_stock);
+   }
+}
+
 /*
  * Cache charges(val) which is from res_counter, to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
@@ -6268,15 +6279,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 
/* root ? */
if (cont->parent == NULL) {
-   int cpu;
-
root_mem_cgroup = memcg;
-   for_each_possible_cpu(cpu) {
-   struct memcg_stock_pcp *stock =
-   &per_cpu(memcg_stock, cpu);
-   INIT_WORK(&stock->work, drain_local_stock);
-   }
-
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
res_counter_init(&memcg->kmem, NULL);
@@ -7014,6 +7017,7 @@ static int __init mem_cgroup_init(void)
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
enable_swap_cgroup();
mem_cgroup_soft_limit_tree_init();
+   memcg_stock_init();
return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko

On Tue 05-02-13 15:49:47, azurIt wrote:
[...]
> I have another old problem which is maybe also related to this. I
> wasn't connecting it with this before but now i'm not sure. Two of our
> servers, which are affected by this cgroup problem, are also randomly
> freezing completely (few times per month). These are the symptoms:
>  - servers are answering to ping
>  - it is possible to connect via SSH but connection is freezed after
>  sending the password
>  - it is possible to login via console but it is freezed after typeing
>  the login
> These symptoms are very similar to HDD problems or HDD overload (but
> there is no overload for sure). The only way to fix it is, probably,
> hard rebooting the server (didn't find any other way). What do you
> think? Can this be related?

This is hard to tell without further information.

> Maybe HDDs are locked in the similar way the cgroups are - we already
> found out that cgroup freezeing is related also to HDD activity. Maybe
> there is a little chance that the whole HDD subsystem ends in
> deadlock?

"HDD subsystem" whatever that means cannot be blocked by memcg being
stuck. Certain access to soem files might be an issue because those
could have locks held but I do not see other relations.

I would start by checking the HW, trying to focus on reducing elements
that could contribute - aka try to nail down to the minimum set which
reproduces the issue. I cannot help you much with that I am afraid.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko

On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 15:49:47, azurIt wrote:
> > [...]
> >> Just to be sure - am i supposed to apply this two patches?
> >> http://watchdog.sk/lkml/patches/
> >
> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > mentioned in a follow up email. Here is the full patch:
> > ---
> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >
> > memcg oom killer might deadlock if the process which falls down to
> > mem_cgroup_handle_oom holds a lock which prevents other task to
> > terminate because it is blocked on the very same lock.
> > This can happen when a write system call needs to allocate a page but
> > the allocation hits the memcg hard limit and there is nothing to reclaim
> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > have been reclaimed already) and the process selected by memcg OOM
> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >
> > Process A
> > [] do_truncate+0x58/0xa0  # takes i_mutex
> > [] do_last+0x250/0xa30
> > [] path_openat+0xd7/0x440
> > [] do_filp_open+0x49/0xa0
> > [] do_sys_open+0x106/0x240
> > [] sys_open+0x20/0x30
> > [] system_call_fastpath+0x18/0x1d
> > [] 0x
> >
> > Process B
> > [] mem_cgroup_handle_oom+0x241/0x3b0
> > [] T.1146+0x5ab/0x5c0
> > [] mem_cgroup_cache_charge+0xbe/0xe0
> > [] add_to_page_cache_locked+0x4c/0x140
> > [] add_to_page_cache_lru+0x22/0x50
> > [] grab_cache_page_write_begin+0x8b/0xe0
> > [] ext3_write_begin+0x88/0x270
> > [] generic_file_buffered_write+0x116/0x290
> > [] __generic_file_aio_write+0x27c/0x480
> > [] generic_file_aio_write+0x76/0xf0   # takes 
> > ->i_mutex
> > [] do_sync_write+0xea/0x130
> > [] vfs_write+0xf3/0x1f0
> > [] sys_write+0x51/0x90
> > [] system_call_fastpath+0x18/0x1d
> > [] 0x
> 
> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> think that this deadlock is also possible in the page allocator even
> before getting to add_to_page_cache_lru.  no?

I am not that familiar with VFS but i_mutex is a high level lock AFAIR
and it shouldn't be called from the pageout path so __page_cache_alloc
should be safe.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko

On Tue 05-02-13 10:09:57, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> 
> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
> >> > [...]
> >> >> Just to be sure - am i supposed to apply this two patches?
> >> >> http://watchdog.sk/lkml/patches/
> >> >
> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >> > mentioned in a follow up email. Here is the full patch:
> >> > ---
> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> >> > From: Michal Hocko 
> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >> >
> >> > memcg oom killer might deadlock if the process which falls down to
> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
> >> > terminate because it is blocked on the very same lock.
> >> > This can happen when a write system call needs to allocate a page but
> >> > the allocation hits the memcg hard limit and there is nothing to reclaim
> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> >> > have been reclaimed already) and the process selected by memcg OOM
> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >> >
> >> > Process A
> >> > [] do_truncate+0x58/0xa0   # takes i_mutex
> >> > [] do_last+0x250/0xa30
> >> > [] path_openat+0xd7/0x440
> >> > [] do_filp_open+0x49/0xa0
> >> > [] do_sys_open+0x106/0x240
> >> > [] sys_open+0x20/0x30
> >> > [] system_call_fastpath+0x18/0x1d
> >> > [] 0x
> >> >
> >> > Process B
> >> > [] mem_cgroup_handle_oom+0x241/0x3b0
> >> > [] T.1146+0x5ab/0x5c0
> >> > [] mem_cgroup_cache_charge+0xbe/0xe0
> >> > [] add_to_page_cache_locked+0x4c/0x140
> >> > [] add_to_page_cache_lru+0x22/0x50
> >> > [] grab_cache_page_write_begin+0x8b/0xe0
> >> > [] ext3_write_begin+0x88/0x270
> >> > [] generic_file_buffered_write+0x116/0x290
> >> > [] __generic_file_aio_write+0x27c/0x480
> >> > [] generic_file_aio_write+0x76/0xf0   # takes 
> >> > ->i_mutex
> >> > [] do_sync_write+0xea/0x130
> >> > [] vfs_write+0xf3/0x1f0
> >> > [] sys_write+0x51/0x90
> >> > [] system_call_fastpath+0x18/0x1d
> >> > [] 0x
> >> 
> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> >> think that this deadlock is also possible in the page allocator even
> >> before getting to add_to_page_cache_lru.  no?
> >
> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> > and it shouldn't be called from the pageout path so __page_cache_alloc
> > should be safe.
> 
> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
> My concern is that __page_cache_alloc() will invoke the oom killer and
> select a victim which wants i_mutex.  This victim will deadlock because
> the oom killer caller already holds i_mutex.  

That would be true for the memcg oom because that one is blocking but
the global oom just puts the allocator into sleep for a while and then
the allocator should back off eventually (unless this is NOFAIL
allocation). I would need to look closer whether this is really the case
- I haven't seen that allocator code path for a while...

> The wild accusation I am making is that anyone who invokes the oom
> killer and waits on the victim to die is essentially grabbing all of
> the locks that any of the oom killer victims may grab (e.g. i_mutex).

True.

> To avoid deadlock the oom killer can only be called is while holding
> no locks that the oom victim demands.  I think some locks are grabbed
> in a way that allows the lock request to fail if the task has a fatal
> signal pending, so they are safe.  But any locks acquisitions that
> cannot fail (e.g. mutex_lock) will deadlock with the oom killing
> process.  So the oom killing process cannot hold any such locks which
> the victim will attempt to grab.  Hopefully I'm missing something.

Agreed.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] da9030_battery: include notifier.h

2013-02-06 Thread Michal Hocko

randconfig complains about:
drivers/power/da9030_battery.c:113: error: field ‘nb’ has incomplete type

because there is no direct include for notifier.h which defines
struct notifier_block.

Signed-off-by: Michal Hocko 
---
 drivers/power/da9030_battery.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/power/da9030_battery.c b/drivers/power/da9030_battery.c
index 94762e6..e8c5a39 100644
--- a/drivers/power/da9030_battery.c
+++ b/drivers/power/da9030_battery.c
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 #define DA9030_FAULT_LOG   0x0a
 #define DA9030_FAULT_LOG_OVER_TEMP (1 << 7)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-06 Thread Michal Hocko

On Wed 06-02-13 02:17:21, azurIt wrote:
> >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >mentioned in a follow up email. Here is the full patch:
> 
> 
> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
> http://www.watchdog.sk/lkml/oom_mysqld6

[...]
WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
Hardware name: S5000VSA
gfp_mask:4304 nr_pages:1 oom:0 ret:2
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
 [] warn_slowpath_common+0x7a/0xb0
 [] warn_slowpath_fmt+0x46/0x50
 [] ? mem_cgroup_margin+0x73/0xa0
 [] T.1149+0x2d9/0x610
 [] ? blk_finish_plug+0x18/0x50
 [] mem_cgroup_cache_charge+0xc4/0xf0
 [] add_to_page_cache_locked+0x4f/0x140
 [] add_to_page_cache_lru+0x22/0x50
 [] filemap_fault+0x252/0x4f0
 [] __do_fault+0x78/0x5a0
 [] handle_pte_fault+0x84/0x940
 [] ? vma_prio_tree_insert+0x30/0x50
 [] ? vma_link+0x88/0xe0
 [] handle_mm_fault+0x138/0x260
 [] do_page_fault+0x13d/0x460
 [] ? do_mmap_pgoff+0x3dc/0x430
 [] page_fault+0x1f/0x30
---[ end trace 8817670349022007 ]---
apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
apache2 cpuset=uid mems_allowed=0
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
 [] dump_header+0x7e/0x1e0
 [] ? find_lock_task_mm+0x2f/0x70
 [] oom_kill_process+0x85/0x2a0
 [] out_of_memory+0xe5/0x200
 [] pagefault_out_of_memory+0xbd/0x110
 [] mm_fault_error+0xb6/0x1a0
 [] do_page_fault+0x3ee/0x460
 [] ? do_mmap_pgoff+0x3dc/0x430
 [] page_fault+0x1f/0x30

The first trace comes from the debugging WARN and it clearly points to
a file fault path. __do_fault pre-charges a page in case we need to
do CoW (copy-on-write) for the returned page. This one falls back to
memcg OOM and never returns ENOMEM as I have mentioned earlier. 
However, the fs fault handler (filemap_fault here) can fallback to
page_cache_read if the readahead (do_sync_mmap_readahead) fails
to get page to the page cache. And we can see this happening in
the first trace. page_cache_read then calls add_to_page_cache_lru
and eventually gets to add_to_page_cache_locked which calls
mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
happen. This ENOMEM gets to the fault handler and kaboom.

So the fix is really much more complex than I thought. Although
add_to_page_cache_locked sounded like a good place it turned out to be
not in fact.

We need something more clever appaerently. One way would be not misusing
__GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
bits for those flags in gfp_t so there should be some room there. 
Or we could do this per task flag, same we do for NO_IO in the current
-mm tree.
The later one seems easier wrt. gfp_mask passing horror - e.g.
__generic_file_aio_write doesn't pass flags and it can be called from
unlocked contexts as well.

I have to think about it some more.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] da9030_battery: include notifier.h

2013-02-06 Thread Michal Hocko

Ohh, I have just noticed that this could be introduced by "mm: break
circular include from linux/mmzone.h" in mm tree.
Adding Andrew to CC.

On Wed 06-02-13 10:14:58, Michal Hocko wrote:
> randconfig complains about:
> drivers/power/da9030_battery.c:113: error: field ‘nb’ has incomplete type
> 
> because there is no direct include for notifier.h which defines
> struct notifier_block.
> 
> Signed-off-by: Michal Hocko 
> ---
>  drivers/power/da9030_battery.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/power/da9030_battery.c b/drivers/power/da9030_battery.c
> index 94762e6..e8c5a39 100644
> --- a/drivers/power/da9030_battery.c
> +++ b/drivers/power/da9030_battery.c
> @@ -22,6 +22,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define DA9030_FAULT_LOG 0x0a
>  #define DA9030_FAULT_LOG_OVER_TEMP   (1 << 7)
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-06 Thread Michal Hocko

On Wed 06-02-13 15:01:19, Michal Hocko wrote:
> On Wed 06-02-13 02:17:21, azurIt wrote:
> > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > >mentioned in a follow up email. Here is the full patch:
> > 
> > 
> > Here is the log where OOM, again, killed MySQL server [search for 
> > "(mysqld)"]:
> > http://www.watchdog.sk/lkml/oom_mysqld6
> 
> [...]
> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> Hardware name: S5000VSA
> gfp_mask:4304 nr_pages:1 oom:0 ret:2
> Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> Call Trace:
>  [] warn_slowpath_common+0x7a/0xb0
>  [] warn_slowpath_fmt+0x46/0x50
>  [] ? mem_cgroup_margin+0x73/0xa0
>  [] T.1149+0x2d9/0x610
>  [] ? blk_finish_plug+0x18/0x50
>  [] mem_cgroup_cache_charge+0xc4/0xf0
>  [] add_to_page_cache_locked+0x4f/0x140
>  [] add_to_page_cache_lru+0x22/0x50
>  [] filemap_fault+0x252/0x4f0
>  [] __do_fault+0x78/0x5a0
>  [] handle_pte_fault+0x84/0x940
>  [] ? vma_prio_tree_insert+0x30/0x50
>  [] ? vma_link+0x88/0xe0
>  [] handle_mm_fault+0x138/0x260
>  [] do_page_fault+0x13d/0x460
>  [] ? do_mmap_pgoff+0x3dc/0x430
>  [] page_fault+0x1f/0x30
> ---[ end trace 8817670349022007 ]---
> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> apache2 cpuset=uid mems_allowed=0
> Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> Call Trace:
>  [] dump_header+0x7e/0x1e0
>  [] ? find_lock_task_mm+0x2f/0x70
>  [] oom_kill_process+0x85/0x2a0
>  [] out_of_memory+0xe5/0x200
>  [] pagefault_out_of_memory+0xbd/0x110
>  [] mm_fault_error+0xb6/0x1a0
>  [] do_page_fault+0x3ee/0x460
>  [] ? do_mmap_pgoff+0x3dc/0x430
>  [] page_fault+0x1f/0x30
> 
> The first trace comes from the debugging WARN and it clearly points to
> a file fault path. __do_fault pre-charges a page in case we need to
> do CoW (copy-on-write) for the returned page. This one falls back to
> memcg OOM and never returns ENOMEM as I have mentioned earlier. 
> However, the fs fault handler (filemap_fault here) can fallback to
> page_cache_read if the readahead (do_sync_mmap_readahead) fails
> to get page to the page cache. And we can see this happening in
> the first trace. page_cache_read then calls add_to_page_cache_lru
> and eventually gets to add_to_page_cache_locked which calls
> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> happen. This ENOMEM gets to the fault handler and kaboom.
> 
> So the fix is really much more complex than I thought. Although
> add_to_page_cache_locked sounded like a good place it turned out to be
> not in fact.
> 
> We need something more clever appaerently. One way would be not misusing
> __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
> bits for those flags in gfp_t so there should be some room there. 
> Or we could do this per task flag, same we do for NO_IO in the current
> -mm tree.
> The later one seems easier wrt. gfp_mask passing horror - e.g.
> __generic_file_aio_write doesn't pass flags and it can be called from
> unlocked contexts as well.

Ouch, PF_ flags space seem to be drained already because
task_struct::flags is just unsigned int so there is just one bit left. I
am not sure this is the best use for it. This will be a real pain!

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-06 Thread Michal Hocko

On Wed 06-02-13 15:22:19, Michal Hocko wrote:
> On Wed 06-02-13 15:01:19, Michal Hocko wrote:
> > On Wed 06-02-13 02:17:21, azurIt wrote:
> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > > >mentioned in a follow up email. Here is the full patch:
> > > 
> > > 
> > > Here is the log where OOM, again, killed MySQL server [search for 
> > > "(mysqld)"]:
> > > http://www.watchdog.sk/lkml/oom_mysqld6
> > 
> > [...]
> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> > Hardware name: S5000VSA
> > gfp_mask:4304 nr_pages:1 oom:0 ret:2
> > Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> > Call Trace:
> >  [] warn_slowpath_common+0x7a/0xb0
> >  [] warn_slowpath_fmt+0x46/0x50
> >  [] ? mem_cgroup_margin+0x73/0xa0
> >  [] T.1149+0x2d9/0x610
> >  [] ? blk_finish_plug+0x18/0x50
> >  [] mem_cgroup_cache_charge+0xc4/0xf0
> >  [] add_to_page_cache_locked+0x4f/0x140
> >  [] add_to_page_cache_lru+0x22/0x50
> >  [] filemap_fault+0x252/0x4f0
> >  [] __do_fault+0x78/0x5a0
> >  [] handle_pte_fault+0x84/0x940
> >  [] ? vma_prio_tree_insert+0x30/0x50
> >  [] ? vma_link+0x88/0xe0
> >  [] handle_mm_fault+0x138/0x260
> >  [] do_page_fault+0x13d/0x460
> >  [] ? do_mmap_pgoff+0x3dc/0x430
> >  [] page_fault+0x1f/0x30
> > ---[ end trace 8817670349022007 ]---
> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, 
> > oom_score_adj=0
> > apache2 cpuset=uid mems_allowed=0
> > Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> > Call Trace:
> >  [] dump_header+0x7e/0x1e0
> >  [] ? find_lock_task_mm+0x2f/0x70
> >  [] oom_kill_process+0x85/0x2a0
> >  [] out_of_memory+0xe5/0x200
> >  [] pagefault_out_of_memory+0xbd/0x110
> >  [] mm_fault_error+0xb6/0x1a0
> >  [] do_page_fault+0x3ee/0x460
> >  [] ? do_mmap_pgoff+0x3dc/0x430
> >  [] page_fault+0x1f/0x30
> > 
> > The first trace comes from the debugging WARN and it clearly points to
> > a file fault path. __do_fault pre-charges a page in case we need to
> > do CoW (copy-on-write) for the returned page. This one falls back to
> > memcg OOM and never returns ENOMEM as I have mentioned earlier. 
> > However, the fs fault handler (filemap_fault here) can fallback to
> > page_cache_read if the readahead (do_sync_mmap_readahead) fails
> > to get page to the page cache. And we can see this happening in
> > the first trace. page_cache_read then calls add_to_page_cache_lru
> > and eventually gets to add_to_page_cache_locked which calls
> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> > happen. This ENOMEM gets to the fault handler and kaboom.
> > 
> > So the fix is really much more complex than I thought. Although
> > add_to_page_cache_locked sounded like a good place it turned out to be
> > not in fact.
> > 
> > We need something more clever appaerently. One way would be not misusing
> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
> > bits for those flags in gfp_t so there should be some room there. 
> > Or we could do this per task flag, same we do for NO_IO in the current
> > -mm tree.
> > The later one seems easier wrt. gfp_mask passing horror - e.g.
> > __generic_file_aio_write doesn't pass flags and it can be called from
> > unlocked contexts as well.
> 
> Ouch, PF_ flags space seem to be drained already because
> task_struct::flags is just unsigned int so there is just one bit left. I
> am not sure this is the best use for it. This will be a real pain!

OK, so this something that should help you without any risk of false
OOMs. I do not believe that something like that would be accepted
upstream because it is really heavy. We will need to come up with
something more clever for upstream.
I have also added a warning which will trigger when the charge fails. If
you see too many of those messages then there is something bad going on
and the lack of OOM causes userspace to loop without getting any
progress.

So there you go - your personal patch ;) You can drop all other patches.
Please note I have just compile tested it. But it should be pretty
trivial to check it is correct
---
>From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 6 Feb 2013 16:45:07 +0100
Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen wh

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-07 Thread Michal Hocko

On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote:
> (2013/02/06 23:01), Michal Hocko wrote:
> >On Wed 06-02-13 02:17:21, azurIt wrote:
> >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >>>mentioned in a follow up email. Here is the full patch:
> >>
> >>
> >>Here is the log where OOM, again, killed MySQL server [search for 
> >>"(mysqld)"]:
> >>http://www.watchdog.sk/lkml/oom_mysqld6
> >
> >[...]
> >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> >Hardware name: S5000VSA
> >gfp_mask:4304 nr_pages:1 oom:0 ret:2
> >Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> >Call Trace:
> >  [] warn_slowpath_common+0x7a/0xb0
> >  [] warn_slowpath_fmt+0x46/0x50
> >  [] ? mem_cgroup_margin+0x73/0xa0
> >  [] T.1149+0x2d9/0x610
> >  [] ? blk_finish_plug+0x18/0x50
> >  [] mem_cgroup_cache_charge+0xc4/0xf0
> >  [] add_to_page_cache_locked+0x4f/0x140
> >  [] add_to_page_cache_lru+0x22/0x50
> >  [] filemap_fault+0x252/0x4f0
> >  [] __do_fault+0x78/0x5a0
> >  [] handle_pte_fault+0x84/0x940
> >  [] ? vma_prio_tree_insert+0x30/0x50
> >  [] ? vma_link+0x88/0xe0
> >  [] handle_mm_fault+0x138/0x260
> >  [] do_page_fault+0x13d/0x460
> >  [] ? do_mmap_pgoff+0x3dc/0x430
> >  [] page_fault+0x1f/0x30
> >---[ end trace 8817670349022007 ]---
> >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> >apache2 cpuset=uid mems_allowed=0
> >Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> >Call Trace:
> >  [] dump_header+0x7e/0x1e0
> >  [] ? find_lock_task_mm+0x2f/0x70
> >  [] oom_kill_process+0x85/0x2a0
> >  [] out_of_memory+0xe5/0x200
> >  [] pagefault_out_of_memory+0xbd/0x110
> >  [] mm_fault_error+0xb6/0x1a0
> >  [] do_page_fault+0x3ee/0x460
> >  [] ? do_mmap_pgoff+0x3dc/0x430
> >  [] page_fault+0x1f/0x30
> >
> >The first trace comes from the debugging WARN and it clearly points to
> >a file fault path. __do_fault pre-charges a page in case we need to
> >do CoW (copy-on-write) for the returned page. This one falls back to
> >memcg OOM and never returns ENOMEM as I have mentioned earlier.
> >However, the fs fault handler (filemap_fault here) can fallback to
> >page_cache_read if the readahead (do_sync_mmap_readahead) fails
> >to get page to the page cache. And we can see this happening in
> >the first trace. page_cache_read then calls add_to_page_cache_lru
> >and eventually gets to add_to_page_cache_locked which calls
> >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> >happen. This ENOMEM gets to the fault handler and kaboom.
> >
> 
> Hmm. do we need to increase the "limit" virtually at memcg oom until
> the oom-killed process dies ? It may be doable by increasing stock->cache
> of each cpuI think kernel can offer extra virtual charge up to
> oom-killed process's memory usage.

If we can guarantee that the overflow charges do not exceed the memory
usage of the killed process then this would work. The question is, how
do we find out how much we can overflow. immigrate_on_move will play
some role as well as the amount of the shared memory. I am afraid this
would get too complex. Nevertheless the idea is nice.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-08 Thread Michal Hocko

On Fri 08-02-13 06:03:04, azurIt wrote:
> Michal, thank you very much but it just didn't work and broke
> everything :(

I am sorry to hear that. The patch should help to solve the deadlock you
have seen earlier. It in no way can solve side effects of failing writes
and it also cannot help much if the oom is permanent.

> This happened:
> Problem started to occur really often immediately after booting the
> new kernel, every few minutes for one of my users. But everything
> other seems to work fine so i gave it a try for a day (which was a
> mistake). I grabbed some data for you and go to sleep:
> http://watchdog.sk/lkml/memcg-bug-4.tar.gz

Do you have logs from that time period?

I have only glanced through the stacks and most of the threads are
waiting in the mem_cgroup_handle_oom (mostly from the page fault path
where we do not have other options than waiting) which suggests that
your memory limit is seriously underestimated. If you look at the number
of charging failures (memory.failcnt per-group file) then you will get
9332083 failures in _average_ per group. This is a lot!
Not all those failures end with OOM, of course. But it clearly signals
that the workload need much more memory than the limit allows.

> Few hours later i was woke up from my sweet sweet dreams by alerts
> smses - Apache wasn't working and our system failed to restart
> it. When i observed the situation, two apache processes (of that user
> as above) were still running and it wasn't possible to kill them by
> any way. I grabbed some data for you:
> http://watchdog.sk/lkml/memcg-bug-5.tar.gz

There are only 5 groups in this one and all of them have no memory
charged (so no OOM going on). All tasks are somewhere in the ptrace
code.

grep cache -r .
./1360297489/memory.stat:cache 0
./1360297489/memory.stat:total_cache 65642496
./1360297491/memory.stat:cache 0
./1360297491/memory.stat:total_cache 65642496
./1360297492/memory.stat:cache 0
./1360297492/memory.stat:total_cache 65642496
./1360297490/memory.stat:cache 0
./1360297490/memory.stat:total_cache 65642496
./1360297488/memory.stat:cache 0
./1360297488/memory.stat:total_cache 65642496

which suggests that this is a parent group and the memory is charged in
a child group. I guess that all those are under OOM as the number seems
like they have limit at 62M.

> Then I logged to the console and this was waiting for me:
> http://watchdog.sk/lkml/error.jpg

This is just a warning and it should be harmless. There is just one WARN
in ptrace_check_attach:
WARN_ON_ONCE(task_is_stopped(child))

This has been introduced by
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561
and the commit description claim this shouldn't happen. I am not
familiar with this code but it sounds like a bug in the tracing code
which is not related to the discussed issue.

> Finally i rebooted into different kernel, wrote this e-mail and go to
> my lovely bed ;)
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-08 Thread Michal Hocko

On Fri 08-02-13 12:02:49, azurIt wrote:
> >
> >Do you have logs from that time period?
> >
> >I have only glanced through the stacks and most of the threads are
> >waiting in the mem_cgroup_handle_oom (mostly from the page fault path
> >where we do not have other options than waiting) which suggests that
> >your memory limit is seriously underestimated. If you look at the number
> >of charging failures (memory.failcnt per-group file) then you will get
> >9332083 failures in _average_ per group. This is a lot!
> >Not all those failures end with OOM, of course. But it clearly signals
> >that the workload need much more memory than the limit allows.
> 
> 
> What type of logs? I have all.

kernel log would be sufficient.

> Memory usage graph:
> http://www.watchdog.sk/lkml/memory2.png
> 
> New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 
> 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free 
> memory. Higher memory consumption between 3:39 and 5:33 was caused by data 
> backup and was completed few minutes before i restarted the server (this was 
> just a coincidence).
> 
> 
> 
> >There are only 5 groups in this one and all of them have no memory
> >charged (so no OOM going on). All tasks are somewhere in the ptrace
> >code.
> 
> 
> It's all from the same cgroup but from different time.
> 
> 
> 
> >grep cache -r .
> >./1360297489/memory.stat:cache 0
> >./1360297489/memory.stat:total_cache 65642496
> >./1360297491/memory.stat:cache 0
> >./1360297491/memory.stat:total_cache 65642496
> >./1360297492/memory.stat:cache 0
> >./1360297492/memory.stat:total_cache 65642496
> >./1360297490/memory.stat:cache 0
> >./1360297490/memory.stat:total_cache 65642496
> >./1360297488/memory.stat:cache 0
> >./1360297488/memory.stat:total_cache 65642496
> >
> >which suggests that this is a parent group and the memory is charged in
> >a child group. I guess that all those are under OOM as the number seems
> >like they have limit at 62M.
> 
> 
> The cgroup has limit 330M (346030080 bytes).

This limit is for top level groups, right? Those seem to children which
have 62MB charged - is that a limit for those children?

> As i said, these two processes

Which are those two processes?

> were stucked and was impossible to kill them. They were,
> maybe, the processes which i was trying to 'strace' before - 'strace'
> was freezed as always when the cgroup has this problem and i killed it
> (i was just trying if it is the original cgroup problem).

I have no idea what is the strace role here.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-08 Thread Michal Hocko

On Fri 08-02-13 14:56:16, azurIt wrote:
> Data are inside memcg-bug-5.tar.gz in directories bug///

ohh, I didn't get those were timestamp directories. It makes more sense
now.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-08 Thread Michal Hocko

On Fri 08-02-13 14:56:16, azurIt wrote:
> >kernel log would be sufficient.
> 
> 
> Full kernel log from kernel with you newest patch:
> http://watchdog.sk/lkml/kern2.log

OK, so the log says that there is a little slaughter on your yard:
$ grep "Memory cgroup out of memory:" kern2.log | wc -l
220

$ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process 
\([0-9]*\) .*@\1@' | sort -u | wc -l
220

Which means that the oom killer didn't try to kill any task more than
once which is good because it tells us that the killed task manages to
die before we trigger oom again. So this is definitely not a deadlock.
You are just hitting OOM very often.
$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq 
-c | sort -k1 -n
  1 Task in /1091/uid killed as a result of limit of /1091
  1 Task in /1223/uid killed as a result of limit of /1223
  1 Task in /1229/uid killed as a result of limit of /1229
  1 Task in /1255/uid killed as a result of limit of /1255
  1 Task in /1424/uid killed as a result of limit of /1424
  1 Task in /1470/uid killed as a result of limit of /1470
  1 Task in /1567/uid killed as a result of limit of /1567
  2 Task in /1080/uid killed as a result of limit of /1080
  3 Task in /1381/uid killed as a result of limit of /1381
  4 Task in /1185/uid killed as a result of limit of /1185
  4 Task in /1289/uid killed as a result of limit of /1289
  4 Task in /1709/uid killed as a result of limit of /1709
  5 Task in /1279/uid killed as a result of limit of /1279
  6 Task in /1020/uid killed as a result of limit of /1020
  6 Task in /1527/uid killed as a result of limit of /1527
  9 Task in /1388/uid killed as a result of limit of /1388
 17 Task in /1281/uid killed as a result of limit of /1281
 22 Task in /1599/uid killed as a result of limit of /1599
 30 Task in /1155/uid killed as a result of limit of /1155
 31 Task in /1258/uid killed as a result of limit of /1258
 71 Task in /1293/uid killed as a result of limit of /1293

So the group 1293 suffers the most. I would check how much memory the
worklod in the group really needs because this level of OOM cannot
possible be healthy.

The log also says that the deadlock prevention implemented by the patch
triggered and some writes really failed due to potential OOM:
$ grep "If this message shows up" kern2.log 
Feb  8 01:17:10 server01 kernel: [  431.033593] __mem_cgroup_try_charge: 
task:apache2 pid:6733 got ENOMEM without OOM for memcg:8803807d5600. If 
this message shows up very often for the same task then there is a risk that 
the process is not able to make any progress because of the current limit. Try 
to enlarge the hard limit.
Feb  8 01:22:52 server01 kernel: [  773.556782] __mem_cgroup_try_charge: 
task:apache2 pid:12092 got ENOMEM without OOM for memcg:8803807d5600. If 
this message shows up very often for the same task then there is a risk that 
the process is not able to make any progress because of the current limit. Try 
to enlarge the hard limit.
Feb  8 01:22:52 server01 kernel: [  773.567916] __mem_cgroup_try_charge: 
task:apache2 pid:12093 got ENOMEM without OOM for memcg:8803807d5600. If 
this message shows up very often for the same task then there is a risk that 
the process is not able to make any progress because of the current limit. Try 
to enlarge the hard limit.
Feb  8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: 
task:apache2 pid:17734 got ENOMEM without OOM for memcg:88036e956e00. If 
this message shows up very often for the same task then there is a risk that 
the process is not able to make any progress because of the current limit. Try 
to enlarge the hard limit.
Feb  8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: 
task:apache2 pid:8687 got ENOMEM without OOM for memcg:8803654d6e00. If 
this message shows up very often for the same task then there is a risk that 
the process is not able to make any progress because of the current limit. Try 
to enlarge the hard limit.

This doesn't look very unhealthy. I have expected that write would fail
more often but it seems that the biggest memory pressure comes from
mmaps and page faults which have no way other than OOM.

So my suggestion would be to reconsider limits for groups to provide
more realistical environment.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-08 Thread Michal Hocko

> *memcg, gfp_t mask,
>   if (need_to_kill) {
>   finish_wait(&memcg_oom_waitq, &owait.wait);
>   mem_cgroup_out_of_memory(memcg, mask, order);
> + mem_cgroup_make_loan(memcg);
>   } else {
>   schedule();
>   finish_wait(&memcg_oom_waitq, &owait.wait);
> @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct 
> mem_cgroup *memcg,
>   if (!mem_cgroup_is_root(memcg)) {
>   unsigned long bytes = nr_pages * PAGE_SIZE;
> + bytes = mem_cgroup_may_return_loan(memcg, bytes);
> +
>   res_counter_uncharge(&memcg->res, bytes);
>   if (do_swap_account)
>   res_counter_uncharge(&memcg->memsw, bytes);
> @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup 
> *memcg,
>  {
>   struct memcg_batch_info *batch = NULL;
>   bool uncharge_memsw = true;
> + unsigned long val;
>   /* If swapout, usage of swap doesn't decrease */
>   if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup 
> *memcg,
>   batch->memsw_nr_pages++;
>   return;
>  direct_uncharge:
> - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
> + val = nr_pages * PAGE_SIZE;
> + val = mem_cgroup_may_return_loan(memcg, val);
> + res_counter_uncharge(&memcg->res, val);
>   if (uncharge_memsw)
> - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
> + res_counter_uncharge(&memcg->memsw, val);
>   if (unlikely(batch->memcg != memcg))
>   memcg_oom_recover(memcg);
>  }
> @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void)
>  void mem_cgroup_uncharge_end(void)
>  {
>   struct memcg_batch_info *batch = ¤t->memcg_batch;
> + unsigned long val;
>   if (!batch->do_batch)
>   return;
> @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void)
>   if (!batch->memcg)
>   return;
> + val = batch->nr_pages * PAGE_SIZE;
> + val = mem_cgroup_may_return_loan(batch->memcg, val);
>   /*
>* This "batch->memcg" is valid without any css_get/put etc...
>* bacause we hide charges behind us.
>*/
>   if (batch->nr_pages)
> - res_counter_uncharge(&batch->memcg->res,
> -  batch->nr_pages * PAGE_SIZE);
> + res_counter_uncharge(&batch->memcg->res, val);
>   if (batch->memsw_nr_pages)
> - res_counter_uncharge(&batch->memcg->memsw,
> -  batch->memsw_nr_pages * PAGE_SIZE);
> + res_counter_uncharge(&batch->memcg->memsw, val);
>   memcg_oom_recover(batch->memcg);
>   /* forget this pointer (for sanity check) */
>   batch->memcg = NULL;
> @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>   memcg->move_charge_at_immigrate = 0;
>   mutex_init(&memcg->thresholds_lock);
>   spin_lock_init(&memcg->move_lock);
> + memcg->loan = 0;
> + spin_lock_init(&memcg->loan_lock);
>   return &memcg->css;
> -- 
> 1.7.10.2
> 
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-08 Thread Michal Hocko

On Thu 07-02-13 20:27:00, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 10:09:57, Greg Thelen wrote:
> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> 
> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> >> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> >> 
> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
> >> >> > [...]
> >> >> >> Just to be sure - am i supposed to apply this two patches?
> >> >> >> http://watchdog.sk/lkml/patches/
> >> >> >
> >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >> >> > mentioned in a follow up email. Here is the full patch:
> >> >> > ---
> >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> >> >> > From: Michal Hocko 
> >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> >> >> > Subject: [PATCH] memcg: do not trigger OOM from 
> >> >> > add_to_page_cache_locked
> >> >> >
> >> >> > memcg oom killer might deadlock if the process which falls down to
> >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
> >> >> > terminate because it is blocked on the very same lock.
> >> >> > This can happen when a write system call needs to allocate a page but
> >> >> > the allocation hits the memcg hard limit and there is nothing to 
> >> >> > reclaim
> >> >> > (e.g. there is no swap or swap limit is hit as well and all cache 
> >> >> > pages
> >> >> > have been reclaimed already) and the process selected by memcg OOM
> >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >> >> >
> >> >> > Process A
> >> >> > [] do_truncate+0x58/0xa0# takes i_mutex
> >> >> > [] do_last+0x250/0xa30
> >> >> > [] path_openat+0xd7/0x440
> >> >> > [] do_filp_open+0x49/0xa0
> >> >> > [] do_sys_open+0x106/0x240
> >> >> > [] sys_open+0x20/0x30
> >> >> > [] system_call_fastpath+0x18/0x1d
> >> >> > [] 0x
> >> >> >
> >> >> > Process B
> >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0
> >> >> > [] T.1146+0x5ab/0x5c0
> >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0
> >> >> > [] add_to_page_cache_locked+0x4c/0x140
> >> >> > [] add_to_page_cache_lru+0x22/0x50
> >> >> > [] grab_cache_page_write_begin+0x8b/0xe0
> >> >> > [] ext3_write_begin+0x88/0x270
> >> >> > [] generic_file_buffered_write+0x116/0x290
> >> >> > [] __generic_file_aio_write+0x27c/0x480
> >> >> > [] generic_file_aio_write+0x76/0xf0   # 
> >> >> > takes ->i_mutex
> >> >> > [] do_sync_write+0xea/0x130
> >> >> > [] vfs_write+0xf3/0x1f0
> >> >> > [] sys_write+0x51/0x90
> >> >> > [] system_call_fastpath+0x18/0x1d
> >> >> > [] 0x
> >> >> 
> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> >> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> >> >> think that this deadlock is also possible in the page allocator even
> >> >> before getting to add_to_page_cache_lru.  no?
> >> >
> >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> >> > and it shouldn't be called from the pageout path so __page_cache_alloc
> >> > should be safe.
> >> 
> >> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
> >> My concern is that __page_cache_alloc() will invoke the oom killer and
> >> select a victim which wants i_mutex.  This victim will deadlock because
> >> the oom killer caller already holds i_mutex.  
> >
> > That would be true for the memcg oom because that one is blocking but
> > the global oom just puts the allocator into sleep for a while and then
> > the allocator should back off eventually (unless this is NOFAIL
> > allocation). I would need to look closer whether this is really the case
> > - I haven't seen that allocator code path for a while...
> 
> I think the page al

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-08 Thread Michal Hocko

On Fri 08-02-13 17:29:18, Michal Hocko wrote:
[...]
> OK, I have checked the allocator slow path and you are right even
> GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g.
> OOM killed task blocked on down_write(mmap_sem) while the page fault
> handler holding mmap_sem for reading and allocating a new page without
> any progress.

And now that I think about it some more it sounds like it shouldn't be
possible because allocator would fail because it would see
TIF_MEMDIE (OOM killer kills all threads that share the same mm).
But maybe there are other locks that are dangerous, but I think that the
risk is pretty low.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-08 Thread Michal Hocko

On Fri 08-02-13 16:58:05, azurIt wrote:
[...]
> I took the kernel log from yesterday from the same time frame:
> 
> $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | 
> uniq -c | sort -k1 -n
>   1 Task in /1252/uid killed as a result of limit of /1252
>   1 Task in /1709/uid killed as a result of limit of /1709
>   2 Task in /1185/uid killed as a result of limit of /1185
>   2 Task in /1388/uid killed as a result of limit of /1388
>   2 Task in /1567/uid killed as a result of limit of /1567
>   2 Task in /1650/uid killed as a result of limit of /1650
>   3 Task in /1527/uid killed as a result of limit of /1527
>   5 Task in /1552/uid killed as a result of limit of /1552
>1634 Task in /1258/uid killed as a result of limit of /1258
> 
> As you can see, there were much more OOM in '1258' and no such
> problems like this night (well, there were never such problems before
> :) ).

Well, all the patch does is that it prevents from the deadlock we have
seen earlier. Previously the writer would block on the oom wait queue
while it fails with ENOMEM now. Caller sees this as a short write which
can be retried (it is a question whether userspace can cope with that
properly). All other OOMs are preserved.

I suspect that all the problems you are seeing now are just side effects
of the OOM conditions.

> As i said, cgroup 1258 were freezing every few minutes with your
> latest patch so there must be something wrong (it usually freezes
> about once per day). And it was really freezed (i checked that), the
> sypthoms were:

I assume you have checked that the killed processes eventually die,
right?

>  - cannot strace any of cgroup processes
>  - no new processes were started, still the same processes were 'running'
>  - kernel was unable to resolve this by it's own
>  - all processes togather were taking 100% CPU
>  - the whole memory limit was used
> (see memcg-bug-4.tar.gz for more info)

Well, I do not see anything supsicious during that time period
(timestamps translate between Fri Feb  8 02:34:05 and Fri Feb  8
02:36:48). The kernel log shows a lot of oom during that time. All
killed processes die eventually.

> Unfortunately i forget to check if killing only few of the processes
> will resolve it (i always killed them all yesterday night). Don't
> know if is was in deadlock or not but kernel was definitely unable
> to resolve the problem.

Nothing shows it would be a deadlock so far. It is well possible that
the userspace went mad when seeing a lot of processes dying because it
doesn't expect it.

> And there is still a mystery of two freezed processes which cannot be
> killed.
> 
> By the way, i KNOW that so much OOM is not healthy but the client
> simply don't want to buy more memory. He knows about the problem of
> unsufficient memory limit.

Well, then you would see a permanent flood of OOM killing, I am afraid.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-10 Thread Michal Hocko

>the userspace went mad when seeing a lot of processes dying because it
> >doesn't expect it.
> 
> Lots of processes are dying also now, without your latest patch, and
> no such things are happening. I'm sure there is something more it
> this, maybe it revealed another bug?

So far nothing shows that there would be anything broken wrt. memcg OOM
killer. The ptrace issue sounds strange, all right, but that is another
story and worth a separate investigation. I would be interested whether
you still see anything wrong going on without that in game.

You can get pretty nice overview of what is going on wrt. OOM from the
log.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

2013-02-11 Thread Michal Hocko

handle_pte_fault+0x264/0x940
[] handle_mm_fault+0x138/0x260
[] do_page_fault+0x13d/0x460
[] page_fault+0x1f/0x30
[] 0x

And it loops in it until the end which is possible as well if the group
is under permanent OOM condition and the task is not selected to be
killed.

Unfortunately I am not able to reproduce this behavior even if I try
to hammer OOM like mad so I am afraid I cannot help you much without
further debugging patches.
I do realize that experimenting in your environment is a problem but I
do not many options left. Please do not use strace and rather collect
/proc/pid/stack instead. It would be also helpful to get group/tasks
file to have a full list of tasks in the group
---
>From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 11 Feb 2013 12:18:36 +0100
Subject: [PATCH] oom: debug skipping killing

---
 mm/oom_kill.c |   10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..3d759f0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int 
*ppoints,
if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
if (unlikely(frozen(p)))
thaw_process(p);
+   printk(KERN_WARNING"XXX: pid:%d (flags:%u) is 
TIF_MEMDIE. Waiting for it\n",
+   p->pid, p->flags);
return ERR_PTR(-1UL);
}
if (!p->mm)
@@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int 
*ppoints,
 * then wait for it to finish before killing
 * some other task unnecessarily.
 */
-   if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
+   if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) 
{
+   printk(KERN_WARNING"XXX: pid:%d 
(flags:%u) is PF_EXITING. Waiting for it\n",
+   p->pid, p->flags);
return ERR_PTR(-1UL);
+   }
}
}
 
@@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 * its children or threads, just set TIF_MEMDIE so it can die quickly
 */
if (p->flags & PF_EXITING) {
+   printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing 
PF_EXITING\n", p->pid, p->flags);
set_tsk_thread_flag(p, TIF_MEMDIE);
return 0;
}
@@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t 
gfp_mask)
 * its memory.
 */
if (fatal_signal_pending(current)) {
+   printk(KERN_WARNING"XXX: pid:%d (flags:%u) has 
fatal_signal_pending. Waiting for it\n",
+   p->pid, p->flags);
set_thread_flag(TIF_MEMDIE);
return;
}
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators

2013-02-11 Thread Michal Hocko

On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
[...]
> for each in hierarchy:
>   for each node:
> for each zone:
>   for each reclaim priority:
> 
> every time a cgroup is destroyed.  I don't think such a hammer is
> justified in general, let alone for consolidating code a little.
> 
> Can we invalidate the position cache lazily?  Have a global "cgroup
> destruction" counter and store a snapshot of that counter whenever we
> put a cgroup pointer in the position cache.  We only use the cached
> pointer if that counter has not changed in the meantime, so we know
> that the cgroup still exists.

Currently we have:
rcu_read_lock() // keeps cgroup links safe
iter->iter_lock // keeps selection exclusive for a specific iterator
1) global_counter == iter_counter
2) css_tryget(cached_memcg)  // check it is still alive
rcu_read_unlock()

What would protect us from races when css would disappear between 1 and
2?

css is invalidated from worker context scheduled from __css_put and it
is using dentry locking which we surely do not want to pull here. We
could hook into css_offline which is called with cgroup_mutex but we
cannot use this one here because it is no longer exported and Tejun
would kill us for that.
So we can add a new global memcg internal lock to do this atomically.
Ohh, this is getting uglier...

> It is pretty pretty imprecise and we invalidate the whole cache every
> time a cgroup is destroyed, but I think that should be okay. 

I am not sure this is OK because this gives an indirect way of
influencing reclaim in one hierarchy by another one which opens a door
for regressions (or malicious over-reclaim in the extreme case).
So I do not like this very much.

> If not, better ideas are welcome.

Maybe we could keep the counter per memcg but that would mean that we
would need to go up the hierarchy as well. We wouldn't have to go over
node-zone-priority cleanup so it would be much more lightweight.

I am not sure this is necessarily better than explicit cleanup because
it brings yet another kind of generation number to the game but I guess
I can live with it if people really thing the relaxed way is much
better.
What do you think about the patch below (untested yet)?
---
>From 8169aa49649753822661b8fbbfba0852dcfedba6 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 11 Feb 2013 16:13:48 +0100
Subject: [PATCH] memcg: relax memcg iter caching

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

We can fix this issue by relaxing rules for the last_visited memcg as
well.
Instead of taking reference to css we can just use its pointer and
track the number of removed groups for each memcg. This number would be
stored into iterator everytime when a memcg is cached. If the iter count
doesn't match the curent walker root's one we will start over from the
root again. The group counter is incremented upwards the hierarchy every
time a group is removed.

dead_count_lock makes sure that we do not race with memcg removal.

Spotted-by: Ying Han 
Original-idea-by: Johannes Weiner 
Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   68 ++-
 1 file changed, 57 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..65bf2cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
 };

 struct mem_cgroup_reclaim_iter {
-   /* last scanned hierarchy member with elevated css ref count */
+   /*
+* last scanned hierarchy member. Valid only if last_dead_count
+* matches memcg->dead_count of the hierarchy root group.
+*/
struct mem_cgroup *last_visited;
+   unsigned int last_dead_count;
+
/* scan generation, increased every round-trip */
unsigned int generation;
/* lock to protect the position and generation */
@@ -357,6 +362,8 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu nocpu_base;
spinlock_t pcp_counter_lock;

+   spinlock_t  dead_count_lock;
+   unsigned intdead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
struct tcp_memcontrol tcp_mem;
 #endif
@@ -1162,15 +1169,24 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
spin_lock(&iter->iter_lock);
-   last_visited = iter->last_visited

Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators

2013-02-11 Thread Michal Hocko

On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
> > [...]
> > > for each in hierarchy:
> > >   for each node:
> > > for each zone:
> > >   for each reclaim priority:
> > > 
> > > every time a cgroup is destroyed.  I don't think such a hammer is
> > > justified in general, let alone for consolidating code a little.
> > > 
> > > Can we invalidate the position cache lazily?  Have a global "cgroup
> > > destruction" counter and store a snapshot of that counter whenever we
> > > put a cgroup pointer in the position cache.  We only use the cached
> > > pointer if that counter has not changed in the meantime, so we know
> > > that the cgroup still exists.
> > 
> > Currently we have:
> > rcu_read_lock() // keeps cgroup links safe
> > iter->iter_lock // keeps selection exclusive for a specific iterator
> > 1) global_counter == iter_counter
> > 2) css_tryget(cached_memcg)  // check it is still alive
> > rcu_read_unlock()
> > 
> > What would protect us from races when css would disappear between 1 and
> > 2?
> 
> rcu

That was my first attempt but then I convinced myself it might not be
sufficient. But now that I think about it more I guess you are right.
 
> > css is invalidated from worker context scheduled from __css_put and it
> > is using dentry locking which we surely do not want to pull here. We
> > could hook into css_offline which is called with cgroup_mutex but we
> > cannot use this one here because it is no longer exported and Tejun
> > would kill us for that.
> > So we can add a new global memcg internal lock to do this atomically.
> > Ohh, this is getting uglier...
> 
> A racing final css_put() means that the tryget fails, but our RCU read
> lock keeps the CSS allocated.  If the dead_count is uptodate, it means
> that the rcu read lock was acquired before the synchronize_rcu()
> before the css is freed.

yes.

> 
> > > It is pretty pretty imprecise and we invalidate the whole cache every
> > > time a cgroup is destroyed, but I think that should be okay. 
> > 
> > I am not sure this is OK because this gives an indirect way of
> > influencing reclaim in one hierarchy by another one which opens a door
> > for regressions (or malicious over-reclaim in the extreme case).
> > So I do not like this very much.
> > 
> > > If not, better ideas are welcome.
> > 
> > Maybe we could keep the counter per memcg but that would mean that we
> > would need to go up the hierarchy as well. We wouldn't have to go over
> > node-zone-priority cleanup so it would be much more lightweight.
> > 
> > I am not sure this is necessarily better than explicit cleanup because
> > it brings yet another kind of generation number to the game but I guess
> > I can live with it if people really thing the relaxed way is much
> > better.
> > What do you think about the patch below (untested yet)?
> 
> Better, but I think you can get rid of both locks:

What is the other lock you have in mind.

> mem_cgroup_iter:
> rcu_read_lock()
> if atomic_read(&root->dead_count) == iter->dead_count:
>   smp_rmb()
>   if tryget(iter->position):
> position = iter->position
> memcg = find_next(postion)
> css_put(position)
> iter->position = memcg
> smp_wmb() /* Write position cache BEFORE marking it uptodate */
> iter->dead_count = atomic_read(&root->dead_count)
> rcu_read_unlock()

Updated patch bellow:
---
>From 756c4f0091d250bc5ff816f8e9d11840e8522b3a Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 11 Feb 2013 20:23:51 +0100
Subject: [PATCH] memcg: relax memcg iter caching

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

We can fix this issue by relaxing rules for the last_visited memcg as
well.
Instead of taking reference to css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups for each memcg. This number would be stored into iterator
everytime when a memcg is cached. If the iter count doesn't match the
curent walker root's one we will start over from the root again. The
group counter is incremented upwards the hierarchy every ti

Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators

2013-02-11 Thread Michal Hocko

On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> > On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > > Maybe we could keep the counter per memcg but that would mean that we
> > > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > > node-zone-priority cleanup so it would be much more lightweight.
> > > > 
> > > > I am not sure this is necessarily better than explicit cleanup because
> > > > it brings yet another kind of generation number to the game but I guess
> > > > I can live with it if people really thing the relaxed way is much
> > > > better.
> > > > What do you think about the patch below (untested yet)?
> > > 
> > > Better, but I think you can get rid of both locks:
> > 
> > What is the other lock you have in mind.
> 
> The iter lock itself.  I mean, multiple reclaimers can still race but
> there won't be any corruption (if you make iter->dead_count a long,
> setting it happens atomically, we nly need the memcg->dead_count to be
> an atomic because of the inc) and the worst that could happen is that
> a reclaim starts at the wrong point in hierarchy, right?

The lack of synchronization basically means that 2 parallel reclaimers
can reclaim every group exactly once (ideally) or up to each group
twice in the worst case.
So the exclusion was quite comfortable.

> But as you said in the changelog that introduced the lock, it's never
> actually been a practical problem.

That is true but those bugs would be subtle though so I wouldn't be
opposed to prevent from them before we get burnt. But if you think that
we should keep the previous semantic I can drop that patch.

> You just need to put the wmb back in place, so that we never see the
> dead_count give the green light while the cached position is stale, or
> we'll tryget random memory.
> 
> > > mem_cgroup_iter:
> > > rcu_read_lock()
> > > if atomic_read(&root->dead_count) == iter->dead_count:
> > >   smp_rmb()
> > >   if tryget(iter->position):
> > > position = iter->position
> > > memcg = find_next(postion)
> > > css_put(position)
> > > iter->position = memcg
> > > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > > iter->dead_count = atomic_read(&root->dead_count)
> > > rcu_read_unlock()
> > 
> > Updated patch bellow:
> 
> Cool, thanks.  I hope you don't find it too ugly anymore :-)

It's getting trick and you know how people love when you have to play
and rely on atomics with memory barriers...
 
[...]
> > +
> > +   /*
> > +* last_visited might be invalid if some of the group
> > +* downwards was removed. As we do not know which one
> > +* disappeared we have to start all over again from the
> > +* root.
> > +* css ref count then makes sure that css won't
> > +* disappear while we iterate to the next memcg
> > +*/
> > +   last_visited = iter->last_visited;
> > +   dead_count = atomic_read(&root->dead_count);
> > +   smp_rmb();
> 
> Confused about this barrier, see below.
> 
> As per above, if you remove the iter lock, those lines are mixed up.
> You need to read the dead count first because the writer updates the
> dead count after it sets the new position. 

You are right, we need
+   dead_count = atomic_read(&root->dead_count);
+   smp_rmb();
+   last_visited = iter->last_visited;

> That way, if the dead count gives the go-ahead, you KNOW that the
> position cache is valid, because it has been updated first.

OK, you are right. We can live without css_tryget because dead_count is
either OK which means that css would be alive at least this rcu period
(and RCU walk would be safe as well) or it is incremented which means
that we have started css_offline already and then css is dead already.
So css_tryget can be dropped.

> If either the two reads or the two writes get reordered, you risk
> seeing a matching dead count while the position cache is stale.
> 
> > +   if (last_visited &&
> > +   ((dead_count != iter->last_dead_count) 
> > ||
> > +!css_tryget(&last_visited-&g

Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators

2013-02-11 Thread Michal Hocko

On Mon 11-02-13 22:27:56, Michal Hocko wrote:
[...]
> I will get back to this tomorrow.

Maybe not a great idea as it is getting late here and brain turns into
cabbage but there we go:
---
>From f927358fe620837081d7a7ec6bf27af378deb35d Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 11 Feb 2013 23:02:00 +0100
Subject: [PATCH] memcg: relax memcg iter caching

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

We can fix this issue by relaxing rules for the last_visited memcg as
well.
Instead of taking reference to css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups for each memcg. This number would be stored into iterator
everytime when a memcg is cached. If the iter count doesn't match the
curent walker root's one we will start over from the root again. The
group counter is incremented upwards the hierarchy every time a group is
removed.

Locking rules got a bit complicated. We primarily rely on rcu read
lock which makes sure that once we see an up-to-date dead_count then
iter->last_visited is valid for RCU walk. smp_rmb makes sure that
dead_count is read before last_visited and last_dead_count while smp_wmb
makes sure that last_visited is updated before last_dead_count so the
up-to-date last_dead_count cannot point to an outdated last_visited.
Which also means that css reference counting is no longer needed because
RCU will keep last_visited alive.

Spotted-by: Ying Han 
Original-idea-by: Johannes Weiner 
Signed-off-by: Michal Hocko 
---
 mm/memcontrol.c |   53 -
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..42f9d94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
 };

 struct mem_cgroup_reclaim_iter {
-   /* last scanned hierarchy member with elevated css ref count */
+   /*
+* last scanned hierarchy member. Valid only if last_dead_count
+* matches memcg->dead_count of the hierarchy root group.
+*/
struct mem_cgroup *last_visited;
+   unsigned int last_dead_count;
+
/* scan generation, increased every round-trip */
unsigned int generation;
/* lock to protect the position and generation */
@@ -357,6 +362,7 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu nocpu_base;
spinlock_t pcp_counter_lock;

+   atomic_tdead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
struct tcp_memcontrol tcp_mem;
 #endif
@@ -1158,19 +1164,30 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
int nid = zone_to_nid(reclaim->zone);
int zid = zone_idx(reclaim->zone);
struct mem_cgroup_per_zone *mz;
+   unsigned int dead_count;

mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
spin_lock(&iter->iter_lock);
-   last_visited = iter->last_visited;
if (prev && reclaim->generation != iter->generation) {
-   if (last_visited) {
-   css_put(&last_visited->css);
-   iter->last_visited = NULL;
-   }
+   iter->last_visited = NULL;
spin_unlock(&iter->iter_lock);
goto out_unlock;
}
+
+   /*
+* last_visited might be invalid if some of the group
+* downwards was removed. As we do not know which one
+* disappeared we have to start all over again from the
+* root.
+*/
+   dead_count = atomic_read(&root->dead_count);
+   smp_rmb();
+   last_visited = iter->last_visited;
+   if (last_visited &&
+   ((dead_count != 
iter->last_dead_count))) {
+   last_visited = NULL;
+   }
}

/*
@@ -1210,10 +1227,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
if (css && !memcg)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 11648 matches

Mail list logo