from:"Cody P Schafer"

[PATCH] tools/perf: don't include libio.h in arch's dwarf-regs.c

2013-02-04 Thread Cody P Schafer

Including libio.h causes build failures on uClibc systems (which lack
libio.h).

It appears that libio.h was only included to pull in a
definition for null, so it has been replaced by stddef.h

On powerpc, libio.h was conditionally included, but could be removed
completely as it is unneeded. Also, the included of stdlib.h was changed
to stddef.h (as again, only NULL is needed).

Build tested on powerpc and x86.

Signed-off-by: Cody P Schafer 
---
 tools/perf/arch/powerpc/util/dwarf-regs.c | 5 +
 tools/perf/arch/s390/util/dwarf-regs.c| 2 +-
 tools/perf/arch/sh/util/dwarf-regs.c  | 2 +-
 tools/perf/arch/sparc/util/dwarf-regs.c   | 2 +-
 tools/perf/arch/x86/util/dwarf-regs.c | 2 +-
 5 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/tools/perf/arch/powerpc/util/dwarf-regs.c 
b/tools/perf/arch/powerpc/util/dwarf-regs.c
index 7cdd61d..733151c 100644
--- a/tools/perf/arch/powerpc/util/dwarf-regs.c
+++ b/tools/perf/arch/powerpc/util/dwarf-regs.c
@@ -9,10 +9,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include 
-#ifndef __UCLIBC__
-#include 
-#endif
+#include 
 #include 
 
 
diff --git a/tools/perf/arch/s390/util/dwarf-regs.c 
b/tools/perf/arch/s390/util/dwarf-regs.c
index e19653e..0469df0 100644
--- a/tools/perf/arch/s390/util/dwarf-regs.c
+++ b/tools/perf/arch/s390/util/dwarf-regs.c
@@ -6,7 +6,7 @@
  *
  */
 
-#include 
+#include 
 #include 
 
 #define NUM_GPRS 16
diff --git a/tools/perf/arch/sh/util/dwarf-regs.c 
b/tools/perf/arch/sh/util/dwarf-regs.c
index a11edb0..0d0897f 100644
--- a/tools/perf/arch/sh/util/dwarf-regs.c
+++ b/tools/perf/arch/sh/util/dwarf-regs.c
@@ -19,7 +19,7 @@
  *
  */
 
-#include 
+#include 
 #include 
 
 /*
diff --git a/tools/perf/arch/sparc/util/dwarf-regs.c 
b/tools/perf/arch/sparc/util/dwarf-regs.c
index 0ab8848..92eda41 100644
--- a/tools/perf/arch/sparc/util/dwarf-regs.c
+++ b/tools/perf/arch/sparc/util/dwarf-regs.c
@@ -9,7 +9,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include 
+#include 
 #include 
 
 #define SPARC_MAX_REGS 96
diff --git a/tools/perf/arch/x86/util/dwarf-regs.c 
b/tools/perf/arch/x86/util/dwarf-regs.c
index a794d30..be22dd4 100644
--- a/tools/perf/arch/x86/util/dwarf-regs.c
+++ b/tools/perf/arch/x86/util/dwarf-regs.c
@@ -20,7 +20,7 @@
  *
  */
 
-#include 
+#include 
 #include 
 
 /*
-- 
1.8.1.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/9] mm/page_alloc: add informative debugging message in page_outside_zone_boundaries()

2013-02-05 Thread Cody P Schafer


On 02/01/2013 04:29 PM, Andrew Morton wrote:

On Fri, 1 Feb 2013 16:28:48 -0800
Andrew Morton  wrote:


+   if (ret)
+   pr_debug("page %lu outside zone [ %lu - %lu ]\n",
+   pfn, start_pfn, start_pfn + sp);
+
return ret;
  }


As this condition leads to a VM_BUG_ON(), "pr_debug" seems rather wimpy
and I doubt if we need to be concerned about flooding the console.

I'll switch it to pr_err.


otoh, as nobody has ever hit that VM_BUG_ON() (yes?), do we really need
the patch?


I've hit this bug while developing some code that moves pages between zones.

As it helped me debug that issue with my own code, I could see how 
another developer might be helped by it.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/9] mm: add SECTION_IN_PAGE_FLAGS

2013-02-05 Thread Cody P Schafer


On 02/01/2013 04:20 PM, Andrew Morton wrote:

On Thu, 17 Jan 2013 14:52:53 -0800
Cody P Schafer  wrote:


Instead of directly utilizing a combination of config options to determine this,
add a macro to specifically address it.

...

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -625,6 +625,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
  #define NODE_NOT_IN_PAGE_FLAGS
  #endif

+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+#define SECTION_IN_PAGE_FLAGS
+#endif


We could do this in Kconfig itself, in the definition of a new
CONFIG_SECTION_IN_PAGE_FLAGS.


Yep, I only put it here because it "sounds" the similar to 
NODE_NOT_IN_PAGE_FLAGS, but (of course) NODE_NOT_IN_PAGE_FLAGS isn't 
defined based on pure dependencies, while this is.



I'm not sure that I like that sort of thing a lot though - it's rather a
pain to have to switch from .[ch] over to Kconfig to find the
definitions of things.  I should get off my tail and teach my ctags
scripts to handle this.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/9] mm: zone & pgdat accessors plus some cleanup

2013-02-05 Thread Cody P Schafer


On 02/01/2013 04:39 PM, Andrew Morton wrote:

On Thu, 17 Jan 2013 14:52:52 -0800
Cody P Schafer  wrote:


Summaries:
1 - avoid repeating checks for section in page flags by adding a define.
2 - add & switch to zone_end_pfn() and zone_spans_pfn()
3 - adds zone_is_initialized() & zone_is_empty()
4 - adds a VM_BUG using zone_is_initialized() in __free_one_page()
5 - add pgdat_end_pfn() and pgdat_is_empty()
6 - add debugging message to VM_BUG check.
7 - add ensure_zone_is_initialized() (for memory_hotplug)
8 - use the above addition in memory_hotplug
9 - use pgdat_end_pfn()


Well that's a nice little patchset.

Some of the patches were marked From:c...@linux.vnet.ibm.com and others
were From:jmes...@gmail.com.  This is strange.  If you want me to fix
that up, please let me know which is preferred.


They should all be "From:c...@linux.vnet.ibm.com", other address was me 
messing up gitconfig (which I've since fixed).



As a general concern: spanned_pages & start_pfn (in pgdat & zone) are supposed
to be locked (via a seqlock) when read (due to changes to them via
memory_hotplug), but very few (only 1?) of their users appear to actually lock
them.


OK, thanks.  Perhaps this is something which the memory-hotplug
developers could take a look at?


Yep. It's not immediately clear that not locking on read would do 
terrible things, but at the least the documentation needs fixing and 
explanation as to why the locking is not used some (or all) places.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

2013-04-05 Thread Cody P Schafer

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high
based on ->batch.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..5877cf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone)
 #endif
 }
 
+/* a companion to setup_pagelist_highmark() */
+static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
+{
+   struct per_cpu_pages *pcp = &p->pcp;
+   pcp->high = 6 * batch;
+   pcp->batch = max(1UL, 1 * batch);
+}
+
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
struct per_cpu_pages *pcp;
@@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 
pcp = &p->pcp;
pcp->count = 0;
-   pcp->high = 6 * batch;
-   pcp->batch = max(1UL, 1 * batch);
+   pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
@@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
  * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
  * to the value high for the pageset p.
  */
-
 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
 {
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-05 Thread Cody P Schafer

No off-cpu users of the percpu pagesets exist.

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields. Avoid calling
setup_pageset() (and the draining required to call it) and instead just
set the fields' values.

This does change the behavior of zone_pcp_update() as the percpu
pagesets will not be drained when zone_pcp_update() is called (they will
end up being shrunk, not completely drained, later when a 0-order page
is freed in free_hot_cold_page()).

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 30 ++
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5877cf0..48f2faa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5987,32 +5987,22 @@ void free_contig_range(unsigned long pfn, unsigned 
nr_pages)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit __zone_pcp_update(void *data)
+static void __meminit __zone_pcp_update(void *data)
 {
struct zone *zone = data;
-   int cpu;
-   unsigned long batch = zone_batchsize(zone), flags;
-
-   for_each_possible_cpu(cpu) {
-   struct per_cpu_pageset *pset;
-   struct per_cpu_pages *pcp;
-
-   pset = per_cpu_ptr(zone->pageset, cpu);
-   pcp = &pset->pcp;
-
-   local_irq_save(flags);
-   if (pcp->count > 0)
-   free_pcppages_bulk(zone, pcp->count, pcp);
-   drain_zonestat(zone, pset);
-   setup_pageset(pset, batch);
-   local_irq_restore(flags);
-   }
-   return 0;
+   unsigned long batch = zone_batchsize(zone);
+   struct per_cpu_pageset *pset =
+   per_cpu_ptr(zone->pageset, smp_processor_id());
+   pageset_set_batch(pset, batch);
 }
 
+/*
+ * The zone indicated has a new number of managed_pages; batch sizes and percpu
+ * page high values need to be recalulated.
+ */
 void __meminit zone_pcp_update(struct zone *zone)
 {
-   stop_machine(__zone_pcp_update, zone, NULL);
+   on_each_cpu(__zone_pcp_update, zone, true);
 }
 #endif
 
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-05 Thread Cody P Schafer

In free_hot_cold_page(), we rely on pcp->batch remaining stable.
Updating it without being on the cpu owning the percpu pageset
potentially destroys this stability.

Change for_each_cpu() to on_each_cpu() to fix.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48f2faa..507db31 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5475,30 +5475,31 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table 
*table, int write,
return 0;
 }
 
+static void _zone_set_pageset_highmark(void *data)
+{
+   struct zone *zone = data;
+   unsigned long  high;
+   high = zone->managed_pages / percpu_pagelist_fraction;
+   setup_pagelist_highmark(
+   per_cpu_ptr(zone->pageset, smp_processor_id()), high);
+}
+
 /*
  * percpu_pagelist_fraction - changes the pcp->high for each zone on each
  * cpu.  It is the fraction of total pages in each zone that a hot per cpu 
pagelist
  * can have before it gets flushed back to buddy allocator.
  */
-
 int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
struct zone *zone;
-   unsigned int cpu;
int ret;
 
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (!write || (ret < 0))
return ret;
-   for_each_populated_zone(zone) {
-   for_each_possible_cpu(cpu) {
-   unsigned long  high;
-   high = zone->managed_pages / percpu_pagelist_fraction;
-   setup_pagelist_highmark(
-   per_cpu_ptr(zone->pageset, cpu), high);
-   }
-   }
+   for_each_populated_zone(zone)
+   on_each_cpu(_zone_set_pageset_highmark, zone, true);
return 0;
 }
 
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-05 Thread Cody P Schafer

In one case while modifying the ->high and ->batch fields of per cpu pagesets
we're unneededly using stop_machine() (patches 1 & 2), and in another we don't 
have any
syncronization at all (patch 3).

This patchset fixes both of them.

Note that it results in a change to the behavior of zone_pcp_update(), which is
used by memory_hotplug. I _think_ that I've diserned (and preserved) the
essential behavior (changing ->high and ->batch), and only eliminated unneeded
actions (draining the per cpu pages), but this may not be the case.

--
 mm/page_alloc.c | 63 +++--
 1 file changed, 30 insertions(+), 33 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] DNUMA: Runtime NUMA memory layout reconfiguration

2013-02-27 Thread Cody P Schafer

These patches allow the NUMA memory layout (meaning the mapping of a page to a
node) to be changed at runtime in place (without hotplugging).

= Why/when is this useful? =

In virtual machines (VMs) running on NUMA systems both [a] if/when the
hypervisor decides to move their backing memory around (compacting,
prioritizing another VMs desired layout, etc) and [b] in general for
migration of VMs.

The hardware is _already_ changing the NUMA layout underneath us. We have
powerpc64 systems with firmware that currently move the backing memory around,
and have the ability to notify Linux of new NUMA info.

= Code & testing =

web:
https://github.com/jmesmon/linux/tree/dnuma/v26
git:
https://github.com/jmesmon/linux.git dnuma/v26

commit range:

7e4f3230c9161706ebe9d37d774398082dc352de^..01e16461cf4a914feb1a34ed8dd7b28f3e842645

Some patches are marked "XXX: ...", they are only for testing or
temporary documentation purposes.

A debugfs interface allows the NUMA memory layout to be changed.  Basically,
you don't need to have wierd systems to test this, in fact, I've done all my
testing so far in plain old qemu-i386.

A script which stripes the memory between nodes or pushes all memory to a
(potentially new) node is avaliable here:

https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test

Related LSF/MM Topic proposal:

http://permalink.gmane.org/gmane.linux.kernel.mm/95342

= How are you managing to do this? =

Reconfiguration of page->node mappings is done at the page allocator
level by both pulling out free pages (when a new memory layout is
commited) & redirecting pages on free to their new node.

Because we can't change page_node(A) while A is allocated, a rbtree
holding the mapping from pfn ranges to node ids ('struct memlayout')
is introduced to track the pfn->node mapping for
yet-to-be-transplanted pages. A lookup in this rbtree occurs on any
page allocator path that decides which zone to free a page to.

To avoid horrible performance due to rbtree lookups all the time, the
rbtree is only consulted when the page is marked with a new pageflag
(LookupNode).

= Current Limitations =

For the reconfiguration to be effective (and not make the allocator make
poorer choices), updating the cpu->node mappings is also needed. This patchset
does _not_ handle this. Also missing is a way to update topology (node
distances), which is less fatal.

These patches only work on SPARSEMEM and the node id _must_ fit in the pageflags
(can't be pushed out to the section). This generally means that 32-bit
platforms are out (unless you hack MAX_PHYS{ADDR,MEM}_BITS like I do for
testing).

This code does the reconfiguration without hotplugging memory at all (1
errant page doesn't keep us from fixing the rest of them). But it still
depends on MEMORY_HOTPLUG for functions that online nodes & adjust
zone/pgdat size.

Things that need doing (but aren't quite bugs):
 - While the interface is meant to be driven via a hypervisor/firmware, that
   portion is not yet included.
 - notifier for kernel users of memory that need/want their allocations on a
   particular node (NODE_DATA(), for instance).
 - notifier for userspace.
 - a way to allocate things from the appropriate node prior to the page
   allocator being fully updated (could just be "allocate it wrong now &
   reallocate later").
 - Make memlayout faster (potentially via per-node allocation, different data
   structure, or more/smarter caching).
 - (potentially) propagation of updated layout knowledge into kmem_caches
   (SL*B).

Known Bugs:
 - Transplant of free pages is _very_ slow due to excessive use of
   stop_machine() via zone_pcp_update() & build_zonelists(). On my i5 laptop,
   it take ~9 minutes to stripe the layout in blocks of 256 pfns to 3 nodes on a
   128MB 8 cpu x86_32 VM booted with 2 nodes.
 - memory leak when SLUB is used (struct kmem_cache_nodes are leaked), SLAB
   appears fine.
 - Locking of managed_pages/present_pages needs adjustment, or they need to
   be updated outside of the free page path.
 - Exported numa/memory info in sysfs isn't updated (`numactl --show` segfaults,
   `numactl --hardware` shows new nodes as nearly empty).
 - Uses pageflag setters without "owning" pages, could cause loss of pageflag
   updates when combined with non-atomic pageflag users in mm/*.
 - some strange sleeps while atomic (for me they occur when memory is
   moved out of all the boot nodes)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv5 4/8] zswap: add to mm/

2013-02-18 Thread Cody P Schafer


On 02/18/2013 11:24 AM, Seth Jennings wrote:

On 02/15/2013 10:04 PM, Ric Mason wrote:

On 02/14/2013 02:38 AM, Seth Jennings wrote:



+/* invalidates all pages for the given swap type */
+static void zswap_frontswap_invalidate_area(unsigned type)
+{
+struct zswap_tree *tree = zswap_trees[type];
+struct rb_node *node, *next;
+struct zswap_entry *entry;
+
+if (!tree)
+return;
+
+/* walk the tree and free everything */
+spin_lock(&tree->lock);
+node = rb_first(&tree->rbroot);
+while (node) {
+entry = rb_entry(node, struct zswap_entry, rbnode);
+zs_free(tree->pool, entry->handle);
+next = rb_next(node);
+zswap_entry_cache_free(entry);
+node = next;
+}
+tree->rbroot = RB_ROOT;


Why don't need rb_erase for every nodes?


We are freeing the entire tree here.  try_to_unuse() in the swapoff
syscall should have already emptied the tree, but this is here for
completeness.

rb_erase() will do things like rebalancing the tree; something that
just wastes time since we are in the process of freeing the whole
tree.  We are holding the tree lock here so we are sure that no one
else is accessing the tree while it is in this transient broken state.


If we have a sub-tree like:
...
   /
  A
 / \
B   C

B == rb_next(tree)
A == rb_next(B)
C == rb_next(A)

The current code free's A (via zswap_entry_cache_free()) prior to 
examining C, and thus rb_next(C) results in a use after free of A.


You can solve this by doing a post-order traversal of the tree, either

a) in the destructive manner used in a number of filesystems, see 
fs/ubifs/orphan.c ubifs_add_orphan(), for example.


b) or by doing something similar to this commit: 
https://github.com/jmesmon/linux/commit/d9e43aaf9e8a447d6802531d95a1767532339fad 
, which I've been using for some yet-to-be-merged code.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv6 1/8] zsmalloc: add to mm/

2013-02-21 Thread Cody P Schafer

On Wed, Feb 20, 2013 at 04:04:41PM -0600, Seth Jennings wrote:
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> +#define MAX(a, b) ((a) >= (b) ? (a) : (b))
> +/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
> +#define ZS_MIN_ALLOC_SIZE \
> + MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))

Could you use the max(a,b) defined in include/linux/kernel.h?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH] tools/perf: fix build on non-glibc systems due to libio.h absence

2013-03-14 Thread Cody P Schafer

Including libio.h causes build failures on uClibc systems (which lack
libio.h).

It appears that libio.h was only included to pull in a definition for
NULL, so it has been replaced by stddef.h.

On powerpc, libio.h was conditionally included, but could be removed
completely as it is unneeded. Also, the included of stdlib.h was changed
to stddef.h (as again, only NULL is needed).

Signed-off-by: Cody P Schafer 
---
 tools/perf/arch/arm/util/dwarf-regs.c | 5 +
 tools/perf/arch/powerpc/util/dwarf-regs.c | 5 +
 tools/perf/arch/s390/util/dwarf-regs.c| 2 +-
 tools/perf/arch/sh/util/dwarf-regs.c  | 2 +-
 tools/perf/arch/sparc/util/dwarf-regs.c   | 2 +-
 tools/perf/arch/x86/util/dwarf-regs.c | 2 +-
 6 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/tools/perf/arch/arm/util/dwarf-regs.c 
b/tools/perf/arch/arm/util/dwarf-regs.c
index e8d5c55..33ec5b3 100644
--- a/tools/perf/arch/arm/util/dwarf-regs.c
+++ b/tools/perf/arch/arm/util/dwarf-regs.c
@@ -8,10 +8,7 @@
  * published by the Free Software Foundation.
  */
 
-#include 
-#ifndef __UCLIBC__
-#include 
-#endif
+#include 
 #include 
 
 struct pt_regs_dwarfnum {
diff --git a/tools/perf/arch/powerpc/util/dwarf-regs.c 
b/tools/perf/arch/powerpc/util/dwarf-regs.c
index 7cdd61d..733151c 100644
--- a/tools/perf/arch/powerpc/util/dwarf-regs.c
+++ b/tools/perf/arch/powerpc/util/dwarf-regs.c
@@ -9,10 +9,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include 
-#ifndef __UCLIBC__
-#include 
-#endif
+#include 
 #include 
 
 
diff --git a/tools/perf/arch/s390/util/dwarf-regs.c 
b/tools/perf/arch/s390/util/dwarf-regs.c
index e19653e..0469df0 100644
--- a/tools/perf/arch/s390/util/dwarf-regs.c
+++ b/tools/perf/arch/s390/util/dwarf-regs.c
@@ -6,7 +6,7 @@
  *
  */
 
-#include 
+#include 
 #include 
 
 #define NUM_GPRS 16
diff --git a/tools/perf/arch/sh/util/dwarf-regs.c 
b/tools/perf/arch/sh/util/dwarf-regs.c
index a11edb0..0d0897f 100644
--- a/tools/perf/arch/sh/util/dwarf-regs.c
+++ b/tools/perf/arch/sh/util/dwarf-regs.c
@@ -19,7 +19,7 @@
  *
  */
 
-#include 
+#include 
 #include 
 
 /*
diff --git a/tools/perf/arch/sparc/util/dwarf-regs.c 
b/tools/perf/arch/sparc/util/dwarf-regs.c
index 0ab8848..92eda41 100644
--- a/tools/perf/arch/sparc/util/dwarf-regs.c
+++ b/tools/perf/arch/sparc/util/dwarf-regs.c
@@ -9,7 +9,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include 
+#include 
 #include 
 
 #define SPARC_MAX_REGS 96
diff --git a/tools/perf/arch/x86/util/dwarf-regs.c 
b/tools/perf/arch/x86/util/dwarf-regs.c
index a794d30..be22dd4 100644
--- a/tools/perf/arch/x86/util/dwarf-regs.c
+++ b/tools/perf/arch/x86/util/dwarf-regs.c
@@ -20,7 +20,7 @@
  *
  */
 
-#include 
+#include 
 #include 
 
 /*
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/mempolicy: use NUMA_NO_NODE

2013-09-16 Thread Cody P Schafer




@@ -1802,11 +1802,11 @@ static inline unsigned interleave_nid(struct mempolicy 
*pol,

  /*
   * Return the bit number of a random bit set in the nodemask.
- * (returns -1 if nodemask is empty)
+ * (returns NUMA_NO_NOD if nodemask is empty)


s/NUMA_NO_NOD/NUMA_NO_NODE/


   */
  int node_random(const nodemask_t *maskp)
  {
-   int w, bit = -1;
+   int w, bit = NUMA_NO_NODE;

w = nodes_weight(*maskp);
if (w)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror

2013-09-11 Thread Cody P Schafer


On 09/11/2013 03:08 PM, Dave Hansen wrote:

I really don't know where the:

batch /= 4; /* We effectively *= 4 below */
...
batch = rounddown_pow_of_two(batch + batch/2) - 1;

came from.  The round down code at *MOST* does a *= 1.5, but
*averages* out to be just under 1.

On a system with 128GB in a zone, this means that we've got
(you can see in /proc/zoneinfo for yourself):

   high:  186 (744kB)
   batch: 31  (124kB)

That 124kB is almost precisely 1/4 of the "1/2 of a meg" that we
were shooting for.  We're under-sizing the batches by about 4x.
This patch kills the /=4.

---
diff -puN mm/page_alloc.c~debug-pcp-sizes-1 mm/page_alloc.c
--- linux.git/mm/page_alloc.c~debug-pcp-sizes-1 2013-09-11 14:41:08.532445664 
-0700
+++ linux.git-davehans/mm/page_alloc.c  2013-09-11 15:03:47.403912683 -0700
@@ -4103,7 +4103,6 @@ static int __meminit zone_batchsize(stru
batch = zone->managed_pages / 1024;
if (batch * PAGE_SIZE > 512 * 1024)
batch = (512 * 1024) / PAGE_SIZE;
-   batch /= 4; /* We effectively *= 4 below */
if (batch < 1)
batch = 1;

_



Looking back at the first git commit (way before my time), it appears 
that the percpu pagesets initially had a ->high and ->low (now removed), 
set to batch*6 and batch*2 respectively. I assume the idea was to keep 
the number of pages in the percpu pagesets around batch*4, hence the 
comment.


So we have this variable called "batch", and the code is trying to store 
the _average_ number of pcp pages we want into it (not the batchsize), 
and then we divide our "average" goal by 4 to get a batchsize. All the 
comments refer to the size of the pcp pagesets, not to the pcp pageset 
batchsize.


Looking further, in current code we don't refill the pcp pagesets unless 
they are completely empty (->low was removed a while ago), and then we 
only add ->batch pages.


Has anyone looked at what type of average pcp sizing the current code 
results in?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror

2013-09-11 Thread Cody P Schafer


On 09/11/2013 04:08 PM, Cody P Schafer wrote:

On 09/11/2013 03:08 PM, Dave Hansen wrote:

I really don't know where the:

batch /= 4; /* We effectively *= 4 below */
...
batch = rounddown_pow_of_two(batch + batch/2) - 1;

came from.  The round down code at *MOST* does a *= 1.5, but
*averages* out to be just under 1.

On a system with 128GB in a zone, this means that we've got
(you can see in /proc/zoneinfo for yourself):

   high:  186 (744kB)
   batch: 31  (124kB)

That 124kB is almost precisely 1/4 of the "1/2 of a meg" that we
were shooting for.  We're under-sizing the batches by about 4x.
This patch kills the /=4.

---
diff -puN mm/page_alloc.c~debug-pcp-sizes-1 mm/page_alloc.c
--- linux.git/mm/page_alloc.c~debug-pcp-sizes-12013-09-11
14:41:08.532445664 -0700
+++ linux.git-davehans/mm/page_alloc.c2013-09-11
15:03:47.403912683 -0700
@@ -4103,7 +4103,6 @@ static int __meminit zone_batchsize(stru
  batch = zone->managed_pages / 1024;
  if (batch * PAGE_SIZE > 512 * 1024)
  batch = (512 * 1024) / PAGE_SIZE;
-batch /= 4;/* We effectively *= 4 below */
  if (batch < 1)
  batch = 1;

_



Looking back at the first git commit (way before my time), it appears
that the percpu pagesets initially had a ->high and ->low (now removed),
set to batch*6 and batch*2 respectively. I assume the idea was to keep
the number of pages in the percpu pagesets around batch*4, hence the
comment.

So we have this variable called "batch", and the code is trying to store
the _average_ number of pcp pages we want into it (not the batchsize),
and then we divide our "average" goal by 4 to get a batchsize. All the
comments refer to the size of the pcp pagesets, not to the pcp pageset
batchsize.

Looking further, in current code we don't refill the pcp pagesets unless
they are completely empty (->low was removed a while ago), and then we
only add ->batch pages.

Has anyone looked at what type of average pcp sizing the current code
results in?


Also, we may want to consider shrinking pcp->high down from 6*pcp->batch 
given that the original "6*" choice was based upon ->batch actually 
being 1/4th of the average pageset size, where now it appears closer to 
being the average.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 02/10] mm: convert mm->nr_ptes to atomic_t

2013-09-27 Thread Cody P Schafer


On 09/27/2013 06:16 AM, Kirill A. Shutemov wrote:

With split page table lock for PMD level we can't hold
mm->page_table_lock while updating nr_ptes.

Let's convert it to atomic_t to avoid races.




---



diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 84e0c56e1e..99f19e850d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -339,6 +339,7 @@ struct mm_struct {
pgd_t * pgd;
atomic_t mm_users;  /* How many users with user 
space? */
atomic_t mm_count;  /* How many references to "struct 
mm_struct" (users count as 1) */
+   atomic_t nr_ptes;   /* Page table pages */
int map_count;  /* number of VMAs */

spinlock_t page_table_lock; /* Protects page tables and 
some counters */
@@ -360,7 +361,6 @@ struct mm_struct {
unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE */
unsigned long stack_vm; /* VM_GROWSUP/DOWN */
unsigned long def_flags;
-   unsigned long nr_ptes;  /* Page table pages */
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;


Will 32bits always be enough here? Should atomic_long_t be used instead?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] rcu: comment: correct 'optimized' to 'optimize'.

2013-01-04 Thread Cody P Schafer

Small gramar fix in rcutree comment regarding 'rcu_scheduler_active'
var.
---
 kernel/rcutree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e441b77..bfb8972 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -105,7 +105,7 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # 
rcu_nodes in use. */
  * The rcu_scheduler_active variable transitions from zero to one just
  * before the first task is spawned.  So when this variable is zero, RCU
  * can assume that there is but one task, allowing RCU to (for example)
- * optimized synchronize_sched() to a simple barrier().  When this variable
+ * optimize synchronize_sched() to a simple barrier().  When this variable
  * is one, RCU must actually do all the hard work required to detect real
  * grace periods.  This variable is also used to suppress boot-time false
  * positives from lockdep-RCU error checking.
-- 
1.8.0.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] powerpc/mm: eliminate unneeded for_each_memblock

2013-01-04 Thread Cody P Schafer

The only persistent change made by this loop is calling
memblock_set_node() once for each memblock, which is not useful (and has
no effect) as memblock_set_node() is not called with any
memblock-specific parameters.

Subsistute a single memblock_set_node().
---
 arch/powerpc/mm/mem.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0dba506..50370bb 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -195,13 +195,8 @@ void __init do_init_bootmem(void)
min_low_pfn = MEMORY_START >> PAGE_SHIFT;
boot_mapsize = init_bootmem_node(NODE_DATA(0), start >> PAGE_SHIFT, 
min_low_pfn, max_low_pfn);
 
-   /* Add active regions with valid PFNs */
-   for_each_memblock(memory, reg) {
-   unsigned long start_pfn, end_pfn;
-   start_pfn = memblock_region_memory_base_pfn(reg);
-   end_pfn = memblock_region_memory_end_pfn(reg);
-   memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
-   }
+   /* Place all memblock_regions in the same node and merge contiguous 
memblock_regions */
+   memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
 
/* Add all physical memory to the bootmem map, mark each area
 * present.
-- 
1.8.0.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/page_alloc: don't re-init pageset in zone_pcp_update()

2013-06-12 Thread Cody P Schafer


On 06/12/2013 02:20 PM, Andrew Morton wrote:

On Tue, 11 Jun 2013 15:12:59 -0700 Cody P Schafer  
wrote:


Factor pageset_set_high_and_batch() (which contains all needed logic too
set a pageset's ->high and ->batch inrespective of system state) out of
zone_pageset_init(), which avoids us calling pageset_init(), and
unsafely blowing away a pageset at runtime (leaked pages and
potentially some funky allocations would be the result) when memory
hotplug is triggered.


This changelog is pretty screwed up :( It tells us what the patch does
but not why it does it.



It says why it does it, though perhaps a bit hidden:
>  avoids us calling pageset_init(), and unsafely blowing away a pageset


Signed-off-by: Cody P Schafer 
---

Unless memory hotplug is being triggered on boot, this should *not* be cause of 
Valdis
Kletnieks' reported bug in -next:
  "next-20130607 BUG: Bad page state in process systemd pfn:127643"


And this addendum appears to hint at the info we need.


Note the *not*. I included this note only because I expected there would 
be a question of whether Valdis's reported bug was caused by this. It 
_isn't_. The bug this fix fixes is only triggered by memory_hotplug, and 
Valdis's bug occurred on boot.



Please, send a new changelog?  That should include a description of the
user-visible effects of the bug which is being fixed,  a description of
why it occurs and a description of how it was fixed.It would also be
helpful if you can identify which kernel version(s) need the fix.


It's just a -mm issue. It was introduced by my patchset starting with 
"mm/page_alloc: factor out setting of pcp->high and pcp->batch", where 
the actual place the bug snuck in was "mm/page_alloc: in 
zone_pcp_update(), uze zone_pageset_init()".




Also, a Reported-by:Valdis would be appropriate.

I'm fine with adding it (I did take a look at my page_alloc.c changes 
because he reported that bug), but as mentioned before, this fixes a 
different bug.


Anyhow, a reorganized (and clearer) changelog with the same content follows:
---
mm/page_alloc: don't re-init pageset in zone_pcp_update()

When memory hotplug is triggered, we call pageset_init() on a 
per-cpu-pageset which both contains pages and is in use, causing both 
the leakage of those pages and (potentially) bad behaviour if a page is 
allocated from the pageset while it is being cleared.


Avoid this by factoring pageset_set_high_and_batch() (which contains all 
needed logic too set a pageset's ->high and ->batch inrespective of 
system state), and using that instead of zone_pageset_init() in 
zone_pcp_update().


Signed-off-by: Cody P Schafer 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/page_alloc: don't re-init pageset in zone_pcp_update()

2013-06-12 Thread Cody P Schafer


Anyhow, a reorganized (and clearer) changelog with the same content
follows:
---


I made a few wording tweaks:
---
mm/page_alloc: don't re-init pageset in zone_pcp_update()

When memory hotplug is triggered, we call pageset_init() on
per-cpu-pagesets which both contain pages and are in use, causing both
the leakage of those pages and (potentially) bad behaviour if a page is
allocated from a pageset while it is being cleared.

Avoid this by factoring out pageset_set_high_and_batch() (which contains 
all needed logic too set a pageset's ->high and ->batch inrespective of 
system state) from zone_pageset_init() and using the new 
pageset_set_high_and_batch() instead of zone_pageset_init() in 
zone_pcp_update().


Signed-off-by: Cody P Schafer 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH updated] drivers/base: Use attribute groups to create sysfs memory files

2013-05-29 Thread Cody P Schafer


On 05/28/2013 10:17 PM, Nathan Fontenot wrote:

Update the sysfs memory code to create/delete files at the time of device
and subsystem registration.

The current code creates files in the root memory directory explicitly
through
the use of init_* routines. The files for each memory block are created and
deleted explicitly using the mem_[create|delete]_simple_file macros.

This patch creates attribute groups for the memory root files and files in
each memory block directory so that they are created and deleted implicitly
at subsys and device register and unregister time.

This did necessitate moving the register_memory() routine and update
it to set the dev.groups field.

Signed-off-by: Nathan Fontenot 

Updated to apply cleanly to rc2.

Please cc me on responses/comments.
---
  drivers/base/memory.c |  143
+-
  1 file changed, 62 insertions(+), 81 deletions(-)

Index: linux/drivers/base/memory.c
===
--- linux.orig/drivers/base/memory.c2013-05-28 22:53:58.0 -0500
+++ linux/drivers/base/memory.c 2013-05-28 22:56:49.0 -0500


These changes look good, but this email doesn't play nice with `git am`. ex:

"fatal: corrupt patch at line 80"

There is nothing particularly bad about line 80. Please fix and resend 
(git format-patch generally gets this right, maybe use that?)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm: sparse: use __aligned() instead of manual padding in mem_section

2013-05-29 Thread Cody P Schafer

Instead of leaving a trap for the next person who comes along and wants
to add something to mem_section, add an __aligned() and remove the
manual padding added for MEMCG.

Signed-off-by: Cody P Schafer 
---
 include/linux/mmzone.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

---

Also, does anyone know what causes this alignment to be required here? I found
this was breaking things in a patchset I'm working on (WARNs in sysfs code
about duplicate filenames when initing mem_sections). Adding some documentation
for the reason would be appreciated.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 131989a..a8e8056 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1125,9 +1125,8 @@ struct mem_section {
 * section. (see memcontrol.h/page_cgroup.h about this.)
 */
struct page_cgroup *page_cgroup;
-   unsigned long pad;
 #endif
-};
+} __aligned(2 * sizeof(unsigned long));
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
 #define SECTIONS_PER_ROOT   (PAGE_SIZE / sizeof (struct mem_section))
-- 
1.8.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: sparse: use __aligned() instead of manual padding in mem_section

2013-05-30 Thread Cody P Schafer


On 05/29/2013 05:54 PM, Jiang Liu wrote:

On Thu 30 May 2013 07:14:39 AM CST, Cody P Schafer wrote:

Also, does anyone know what causes this alignment to be required here? I found
this was breaking things in a patchset I'm working on (WARNs in sysfs code
about duplicate filenames when initing mem_sections). Adding some documentation
for the reason would be appreciated.

Hi Cody,
 I think the alignment requirement is caused by the way the
mem_section array is
organized. Basically it requires that PAGE_SIZE could be divided by
sizeof(struct mem_section).
So your change seems risky too because it should be aligned to power of
two instead
of 2 * sizeof(long).


Well, if that's the case then this patch is wrong, and manual padding 
may be the only way to go. :(


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sparsemem: BUILD_BUG_ON when sizeof mem_section is non-power-of-2

2013-05-30 Thread Cody P Schafer

Instead of leaving a hidden trap for the next person who comes along and
wants to add something to mem_section, add a big fat warning about it
needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init()
to catch mistakes.

Right now non-power-of-2 mem_sections cause a number of WARNs at boot
(which don't clearly point to the size of mem_section as an issue), but
the system limps on (temporarily, at least).

This is based upon Dave Hansen's earlier RFC where he ran into the same
issue:
"sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2"
http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html

Signed-off-by: Cody P Schafer 
---

Dave: Consider it resurrected.

---

 include/linux/mmzone.h | 4 
 mm/sparse.c| 3 +++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 131989a..88e23f3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1127,6 +1127,10 @@ struct mem_section {
struct page_cgroup *page_cgroup;
unsigned long pad;
 #endif
+   /*
+* WARNING: mem_section must be a power-of-2 in size for the
+* calculation and use of SECTION_ROOT_MASK to make sense.
+*/
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
diff --git a/mm/sparse.c b/mm/sparse.c
index 1c91f0d3..3194ec4 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -481,6 +481,9 @@ void __init sparse_init(void)
struct page **map_map;
 #endif
 
+   /* see include/linux/mmzone.h 'struct mem_section' definition */
+   BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
+
/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
set_pageblock_order();
 
-- 
1.8.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] drivers/base: Use attribute groups to create sysfs memory files

2013-05-31 Thread Cody P Schafer


On 05/31/2013 01:31 PM, Nathan Fontenot wrote:

Update the sysfs memory code to create/delete files at the time of device
and subsystem registration.

The current code creates files in the root memory directory explicitly through
the use of init_* routines. The files for each memory block are created and
deleted explicitly using the mem_[create|delete]_simple_file macros.

This patch creates attribute groups for the memory root files and files in
each memory block directory so that they are created and deleted implicitly
at subsys and device register and unregister time.

This did necessitate moving the register_memory() updating it to set the
dev.groups field.

Signed-off-by: Nathan Fontenot 

Please cc me on responses/comments.

v2: refreshed the patch, previous version was corrupted. There is no difference
otherwise between this patch and the previous one sent out.


Still looks broken. Tabs have been converted into spaces, and the top of 
the patch is whitespace padded to 80 characters.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm/page_alloc: don't re-init pageset in zone_pcp_update()

2013-06-11 Thread Cody P Schafer

Factor pageset_set_high_and_batch() (which contains all needed logic too
set a pageset's ->high and ->batch inrespective of system state) out of
zone_pageset_init(), which avoids us calling pageset_init(), and
unsafely blowing away a pageset at runtime (leaked pages and
potentially some funky allocations would be the result) when memory
hotplug is triggered.

Signed-off-by: Cody P Schafer 
---

Unless memory hotplug is being triggered on boot, this should *not* be cause of 
Valdis
Kletnieks' reported bug in -next:
 "next-20130607 BUG: Bad page state in process systemd pfn:127643"

---

 mm/page_alloc.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18102e1..f62c7ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4111,11 +4111,9 @@ static void pageset_set_high(struct per_cpu_pageset *p,
pageset_update(&p->pcp, high, batch);
 }
 
-static void __meminit zone_pageset_init(struct zone *zone, int cpu)
+static void __meminit pageset_set_high_and_batch(struct zone *zone,
+   struct per_cpu_pageset *pcp)
 {
-   struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
-
-   pageset_init(pcp);
if (percpu_pagelist_fraction)
pageset_set_high(pcp,
(zone->managed_pages /
@@ -4124,6 +4122,14 @@ static void __meminit zone_pageset_init(struct zone 
*zone, int cpu)
pageset_set_batch(pcp, zone_batchsize(zone));
 }
 
+static void __meminit zone_pageset_init(struct zone *zone, int cpu)
+{
+   struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+   pageset_init(pcp);
+   pageset_set_high_and_batch(zone, pcp);
+}
+
 static void __meminit setup_zone_pageset(struct zone *zone)
 {
int cpu;
@@ -6173,7 +6179,8 @@ void __meminit zone_pcp_update(struct zone *zone)
unsigned cpu;
mutex_lock(&pcp_batch_high_lock);
for_each_possible_cpu(cpu)
-   zone_pageset_init(zone, cpu);
+   pageset_set_high_and_batch(zone,
+   per_cpu_ptr(zone->pageset, cpu));
mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
1.8.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 net-next] net: poll/select low latency socket support

2013-06-27 Thread Cody P Schafer


On 06/24/2013 12:28 AM, Eliezer Tamir wrote:

select/poll busy-poll support.

...

diff --git a/fs/select.c b/fs/select.c
index 8c1c96c..79b876e 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -400,6 +402,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec 
*end_time)
poll_table *wait;
int retval, i, timed_out = 0;
unsigned long slack = 0;
+   unsigned int ll_flag = POLL_LL;
+   u64 ll_time = ll_end_time();

rcu_read_lock();
retval = max_select_fd(n, fds);
@@ -750,6 +768,8 @@ static int do_poll(unsigned int nfds,  struct poll_list 
*list,
ktime_t expire, *to = NULL;
int timed_out = 0, count = 0;
unsigned long slack = 0;
+   unsigned int ll_flag = POLL_LL;
+   u64 ll_time = ll_end_time();

/* Optimise the no-wait case */
if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
diff --git a/include/net/ll_poll.h b/include/net/ll_poll.h
index fcc7c36..5bf2b3a 100644
--- a/include/net/ll_poll.h
+++ b/include/net/ll_poll.h
@@ -38,17 +39,18 @@ extern unsigned int sysctl_net_ll_poll __read_mostly;

  /* we can use sched_clock() because we don't care much about precision
   * we only care that the average is bounded
+ * we don't mind a ~2.5% imprecision so <<10 instead of *1000
+ * sk->sk_ll_usec is a u_int so this can't overflow
   */
-static inline u64 ll_end_time(struct sock *sk)
+static inline u64 ll_sk_end_time(struct sock *sk)
  {
-   u64 end_time = ACCESS_ONCE(sk->sk_ll_usec);
-
-   /* we don't mind a ~2.5% imprecision
-* sk->sk_ll_usec is a u_int so this can't overflow
-*/
-   end_time = (end_time << 10) + sched_clock();
+   return ((u64)ACCESS_ONCE(sk->sk_ll_usec) << 10) + sched_clock();
+}

-   return end_time;
+/* in poll/select we use the global sysctl_net_ll_poll value */
+static inline u64 ll_end_time(void)
+{
+   return ((u64)ACCESS_ONCE(sysctl_net_ll_poll) << 10) + sched_clock();
  }

  static inline bool sk_valid_ll(struct sock *sk)
@@ -62,10 +64,13 @@ static inline bool can_poll_ll(u64 end_time)
return !time_after64(sched_clock(), end_time);
  }

+/* when used in sock_poll() nonblock is known at compile time to be true
+ * so the loop and end_time will be optimized out
+ */
  static inline bool sk_poll_ll(struct sock *sk, int nonblock)
  {
+   u64 end_time = nonblock ? 0 : ll_sk_end_time(sk);
const struct net_device_ops *ops;
-   u64 end_time = ll_end_time(sk);
struct napi_struct *napi;
int rc = false;



I'm seeing warnings about using smp_processor_id() while preemptable 
(log included below) due to this patch. I expect the use of 
ll_end_time() -> sched_clock() here is triggering this.


Apologies if this has already been noted.
--

# [3.114452] BUG: using smp_processor_id() in preemptible [] 
code: sh/62

[3.117970] caller is native_sched_clock+0x20/0x80
[3.120303] CPU: 0 PID: 62 Comm: sh Not tainted 
3.10.0-rc6-dnuma-01032-g2d48d67 #21

[3.123710] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[3.128616]   880002b6baf0 813c07d0 
880002b6bb08
[3.135055]  811ff835 0004d076eeed 880002b6bb20 
81009ac0
[3.137359]   880002b6bb30 81009b29 
880002b6bf40

[3.138954] Call Trace:
[3.139466]  [] dump_stack+0x19/0x1b
[3.140559]  [] debug_smp_processor_id+0xd5/0xf0
[3.141831]  [] native_sched_clock+0x20/0x80
[3.143031]  [] sched_clock+0x9/0x10
[3.144127]  [] do_sys_poll+0x1f6/0x500
[3.145239]  [] ? sched_clock+0x9/0x10
[3.146335]  [] ? native_sched_clock+0x20/0x80
[3.147557]  [] ? sched_clock_local+0x1d/0x90
[3.148816]  [] ? native_sched_clock+0x20/0x80
[3.150007]  [] ? sched_clock+0x9/0x10
[3.151090]  [] ? sched_clock_local+0x1d/0x90
[3.152419]  [] ? native_sched_clock+0x20/0x80
[3.153638]  [] ? native_sched_clock+0x20/0x80
[3.154865]  [] ? sched_clock+0x9/0x10
[3.155961]  [] ? sched_clock_local+0x1d/0x90
[3.157230]  [] ? sched_clock_cpu+0xa8/0x100
[3.158433]  [] ? SyS_getdents64+0x110/0x110
[3.159628]  [] ? native_sched_clock+0x20/0x80
[3.160916]  [] ? sched_clock+0x9/0x10
[3.162003]  [] ? sched_clock_local+0x1d/0x90
[3.163207]  [] ? sched_clock_cpu+0xa8/0x100
[3.164427]  [] ? get_lock_stats+0x19/0x60
[3.165580]  [] ? put_lock_stats.isra.28+0xe/0x40
[3.166856]  [] ? __mutex_unlock_slowpath+0x105/0x1a0
[3.168412]  [] ? trace_hardirqs_on_caller+0x105/0x1d0
[3.169944]  [] ? trace_hardirqs_on+0xd/0x10
[3.171155]  [] ? mutex_unlock+0x9/0x10
[3.172355]  [] ? tty_ioctl+0xa53/0xd40
[3.173483]  [] ? lock_release_non_nested+0x308/0x350
[3.174848]  [] ? __lock_acquire+0x3d6/0xb70
[3.176087]  [] ? trace_hardirqs_on_caller+0x105/0x1d0
[3.177466]  [] ? do_vfs_ioctl+0x305/0x510
[3.178629]  [] ? sysret_check+0x22/0x5d
[3.179764]  [] ? trace_hardirqs_on_caller+0x105/0x1d0
[

Re: [PATCH v4 net-next] net: poll/select low latency socket support

2013-06-27 Thread Cody P Schafer


On 06/27/2013 05:25 PM, Cody P Schafer wrote:

On 06/24/2013 12:28 AM, Eliezer Tamir wrote:

select/poll busy-poll support.
...


I'm seeing warnings about using smp_processor_id() while preemptable
(log included below) due to this patch. I expect the use of
ll_end_time() -> sched_clock() here is triggering this.

Apologies if this has already been noted.


To be clear, given how sched_time() is used here the BUG appears 
invalid, but we need a way to mark this as OK for the smp_processor_id() 
checks so we don't get BUG spam.



--

# [3.114452] BUG: using smp_processor_id() in preemptible []
code: sh/62

...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] trivial: mmzone: correct "pags" to "pages" in comment.

2013-03-26 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 include/linux/mmzone.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c74092e..5c76737 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -450,7 +450,7 @@ struct zone {
 *
 * present_pages is physical pages existing within the zone, which
 * is calculated as:
-*  present_pages = spanned_pages - absent_pages(pags in holes);
+*  present_pages = spanned_pages - absent_pages(pages in holes);
 *
 * managed_pages is present pages managed by the buddy system, which
 * is calculated as (reserved_pages includes pages allocated by the
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] page_alloc: make setup_nr_node_ids() usable for arch init code

2013-03-26 Thread Cody P Schafer

powerpc and x86 were opencoding copies of setup_nr_node_ids(), which
page_alloc provides but makes static. Make it avaliable to the archs in
linux/mm.h.

Signed-off-by: Cody P Schafer 
---
 include/linux/mm.h | 6 ++
 mm/page_alloc.c| 6 +-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7acc9dc..3405405 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1755,5 +1755,11 @@ static inline unsigned int 
debug_guardpage_minorder(void) { return 0; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#if MAX_NUMNODES > 1
+void __init setup_nr_node_ids(void);
+#else
+static inline void setup_nr_node_ids(void) {}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..96909bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4710,7 +4710,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
 /*
  * Figure out the number of possible node ids.
  */
-static void __init setup_nr_node_ids(void)
+void __init setup_nr_node_ids(void)
 {
unsigned int node;
unsigned int highest = 0;
@@ -4719,10 +4719,6 @@ static void __init setup_nr_node_ids(void)
highest = node;
nr_node_ids = highest + 1;
 }
-#else
-static inline void setup_nr_node_ids(void)
-{
-}
 #endif
 
 /**
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] mm: avoid duplication of setup_nr_node_ids()

2013-03-26 Thread Cody P Schafer

In arch/powerpc, arch/x86, and mm/page_alloc code to setup nr_node_ids based on
node_possible_map is duplicated.

This patchset switches those copies to calling the function provided by 
page_alloc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] x86/mm/numa: use setup_nr_node_ids() instead of opencoding.

2013-03-26 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 arch/x86/mm/numa.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 72fe01e..a71c4e2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -114,14 +114,11 @@ void numa_clear_node(int cpu)
  */
 void __init setup_node_to_cpumask_map(void)
 {
-   unsigned int node, num = 0;
+   unsigned int node;
 
/* setup nr_node_ids if not done yet */
-   if (nr_node_ids == MAX_NUMNODES) {
-   for_each_node_mask(node, node_possible_map)
-   num = node;
-   nr_node_ids = num + 1;
-   }
+   if (nr_node_ids == MAX_NUMNODES)
+   setup_nr_node_ids();
 
/* allocate the map */
for (node = 0; node < nr_node_ids; node++)
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] powerpc/mm/numa: use setup_nr_node_ids() instead of opencoding.

2013-03-26 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 arch/powerpc/mm/numa.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index bba87ca..7574ae3 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -62,14 +62,11 @@ static int 
distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
  */
 static void __init setup_node_to_cpumask_map(void)
 {
-   unsigned int node, num = 0;
+   unsigned int node;
 
/* setup nr_node_ids if not done yet */
-   if (nr_node_ids == MAX_NUMNODES) {
-   for_each_node_mask(node, node_possible_map)
-   num = node;
-   nr_node_ids = num + 1;
-   }
+   if (nr_node_ids == MAX_NUMNODES)
+   setup_nr_node_ids()
 
/* allocate the map */
for (node = 0; node < nr_node_ids; node++)
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] drivers: base: dynamic memory block creation

2013-08-14 Thread Cody P Schafer

On 08/14/2013 02:14 PM, Seth Jennings wrote:

>An existing tool would not work
>with this patch (plus boot option) since it would not know how to
>show/hide things.  It lets_part_  of those existing tools get reused
>since they only have to be taught how to show/hide things.
>
>I'd find this really intriguing if you found a way to keep even the old
>tools working.  Instead of having an explicit show/hide, why couldn't
>you just create the entries on open(), for instance?

Nathan and I talked about this and I'm not sure if sysfs would support
such a thing, i.e. memory block creation when someone tried to cd into
the memory block device config.  I wouldn't know where to start on that.

Also, I'd expect userspace tools might use readdir() to find out what 
memory blocks a system has (unless they just stat("memory0"), 
stat("memory1")...). I don't think filesystem tricks (at least within 
sysfs) are going to let this magically be solved without breaking the 
userspace API.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] rbtree_test: add test for postorder iteration

2013-07-26 Thread Cody P Schafer

Just check that we examine all nodes in the tree for the postorder iteration.

Signed-off-by: Cody P Schafer 
---
 lib/rbtree_test.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index 122f02f..31dd4cc 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -114,6 +114,16 @@ static int black_path_count(struct rb_node *rb)
return count;
 }
 
+static void check_postorder(int nr_nodes)
+{
+   struct rb_node *rb;
+   int count = 0;
+   for (rb = rb_first_postorder(&root); rb; rb = rb_next_postorder(rb))
+   count++;
+
+   WARN_ON_ONCE(count != nr_nodes);
+}
+
 static void check(int nr_nodes)
 {
struct rb_node *rb;
@@ -136,6 +146,8 @@ static void check(int nr_nodes)
 
WARN_ON_ONCE(count != nr_nodes);
WARN_ON_ONCE(count < (1 << black_path_count(rb_last(&root))) - 1);
+
+   check_postorder(nr_nodes);
 }
 
 static void check_augmented(int nr_nodes)
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] mm/zswap: use postorder iteration when destroying rbtree

2013-07-26 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/zswap.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index deda2b6..98d99c4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -791,25 +791,14 @@ static void zswap_frontswap_invalidate_area(unsigned type)
 {
struct zswap_tree *tree = zswap_trees[type];
struct rb_node *node;
-   struct zswap_entry *entry;
+   struct zswap_entry *entry, *n;
 
if (!tree)
return;
 
/* walk the tree and free everything */
spin_lock(&tree->lock);
-   /*
-* TODO: Even though this code should not be executed because
-* the try_to_unuse() in swapoff should have emptied the tree,
-* it is very wasteful to rebalance the tree after every
-* removal when we are freeing the whole tree.
-*
-* If post-order traversal code is ever added to the rbtree
-* implementation, it should be used here.
-*/
-   while ((node = rb_first(&tree->rbroot))) {
-   entry = rb_entry(node, struct zswap_entry, rbnode);
-   rb_erase(&entry->rbnode, &tree->rbroot);
+   rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) {
zbud_free(tree->pool, entry->handle);
zswap_entry_cache_free(entry);
atomic_dec(&zswap_stored_pages);
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] rbtree: add rbtree_postorder_for_each_entry_safe() helper

2013-07-26 Thread Cody P Schafer

Because deletion (of the entire tree) is a relatively common use of the
rbtree_postorder iteration, and because doing it safely means fiddling
with temporary storage, provide a helper to simplify postorder rbtree
iteration.

Signed-off-by: Cody P Schafer 
---
 include/linux/rbtree.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 2879e96..64ab98b 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -85,4 +85,21 @@ static inline void rb_link_node(struct rb_node * node, 
struct rb_node * parent,
*rb_link = node;
 }
 
+/**
+ * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of
+ * given type safe against removal of rb_node entry
+ *
+ * @pos:   the 'type *' to use as a loop cursor.
+ * @n: another 'type *' to use as temporary storage
+ * @root:  'rb_root *' of the rbtree.
+ * @field: the name of the rb_node field within 'type'.
+ */
+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \
+   for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\
+ n = rb_entry(rb_next_postorder(&pos->field), \
+ typeof(*pos), field); \
+&pos->field; \
+pos = n, \
+ n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field))
+
 #endif /* _LINUX_RBTREE_H */
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] rbtree: allow tests to run as builtin

2013-07-26 Thread Cody P Schafer

No reason require rbtree test code to be a module, allow it to be
builtin (streamlines my development process)

Signed-off-by: Cody P Schafer 
---
 lib/Kconfig.debug | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1501aa5..606e3c8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1442,7 +1442,7 @@ config BACKTRACE_SELF_TEST
 
 config RBTREE_TEST
tristate "Red-Black tree test"
-   depends on m && DEBUG_KERNEL
+   depends on DEBUG_KERNEL
help
  A benchmark measuring the performance of the rbtree library.
  Also includes rbtree invariant checks.
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] Add rbtree postorder iteration functions, runtime tests, and update zswap to use.

2013-07-26 Thread Cody P Schafer

Postorder iteration yields all of a node's children prior to yielding the node
itself, and this particular implementation also avoids examining the leaf links
in a node after that node has been yielded.

In what I expect will be it's most common usage, postorder iteration allows the
deletion of every node in an rbtree without modifying the rbtree nodes (no
_requirement_ that they be nulled) while avoiding referencing child nodes after
they have been "deleted" (most commonly, freed).

I have only updated zswap to use this functionality at this point, but numerous
bits of code (most notably in the filesystem drivers) use a hand rolled
postorder iteration that NULLs child links as it traverses the tree. Each of
those instances could be replaced with this common implementation.

Cody P Schafer (5):
  rbtree: add postorder iteration functions.
  rbtree: add rbtree_postorder_for_each_entry_safe() helper.
  rbtree_test: add test for postorder iteration.
  rbtree: allow tests to run as builtin
  mm/zswap: use postorder iteration when destroying rbtree

 include/linux/rbtree.h | 21 +
 lib/Kconfig.debug  |  2 +-
 lib/rbtree.c   | 40 
 lib/rbtree_test.c  | 12 
 mm/zswap.c | 15 ++-
 5 files changed, 76 insertions(+), 14 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] rbtree: add postorder iteration functions

2013-07-26 Thread Cody P Schafer

Add postorder iteration functions for rbtree. These are useful for
safely freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer 
---
 include/linux/rbtree.h |  4 
 lib/rbtree.c   | 40 
 2 files changed, 44 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 0022c1b..2879e96 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -68,6 +68,10 @@ extern struct rb_node *rb_prev(const struct rb_node *);
 extern struct rb_node *rb_first(const struct rb_root *);
 extern struct rb_node *rb_last(const struct rb_root *);
 
+/* Postorder iteration - always visit the parent after it's children */
+extern struct rb_node *rb_first_postorder(const struct rb_root *);
+extern struct rb_node *rb_next_postorder(const struct rb_node *);
+
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
 extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct 
rb_node *new,
*new = *victim;
 }
 EXPORT_SYMBOL(rb_replace_node);
+
+static struct rb_node *rb_left_deepest_node(const struct rb_node *node)
+{
+   for (;;) {
+   if (node->rb_left)
+   node = node->rb_left;
+   else if (node->rb_right)
+   node = node->rb_right;
+   else
+   return (struct rb_node *)node;
+   }
+}
+
+struct rb_node *rb_next_postorder(const struct rb_node *node)
+{
+   const struct rb_node *parent;
+   if (!node)
+   return NULL;
+   parent = rb_parent(node);
+
+   /* If we're sitting on node, we've already seen our children */
+   if (parent && node == parent->rb_left && parent->rb_right) {
+   /* If we are the parent's left node, go to the parent's right
+* node then all the way down to the left */
+   return rb_left_deepest_node(parent->rb_right);
+   } else
+   /* Otherwise we are the parent's right node, and the parent
+* should be next */
+   return (struct rb_node *)parent;
+}
+EXPORT_SYMBOL(rb_next_postorder);
+
+struct rb_node *rb_first_postorder(const struct rb_root *root)
+{
+   if (!root->rb_node)
+   return NULL;
+
+   return rb_left_deepest_node(root->rb_node);
+}
+EXPORT_SYMBOL(rb_first_postorder);
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] rbtree: add postorder iteration functions

2013-07-29 Thread Cody P Schafer


On 07/29/2013 08:01 AM, Seth Jennings wrote:

On Fri, Jul 26, 2013 at 02:13:39PM -0700, Cody P Schafer wrote:

diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct 
rb_node *new,
*new = *victim;
  }
  EXPORT_SYMBOL(rb_replace_node);
+
+static struct rb_node *rb_left_deepest_node(const struct rb_node *node)
+{
+   for (;;) {
+   if (node->rb_left)
+   node = node->rb_left;


Assigning to an argument passed as const seems weird to me.  I would
think it shouldn't compile but it does.  I guess my understanding of
const is incomplete.



Ya, that is due to const's binding:
const struct rb_node *node1; // the thing pointed to is const
const struct rb_node node2;  // node is const
struct rb_node *const node3; // node is const
const struct rb_node *const node4; // both node and the thing
   // pointed too are const

And so ends up being perfectly legal (I use the first case listed here).


+   else if (node->rb_right)
+   node = node->rb_right;
+   else
+   return (struct rb_node *)node;
+   }
+}
+
+struct rb_node *rb_next_postorder(const struct rb_node *node)
+{
+   const struct rb_node *parent;
+   if (!node)
+   return NULL;
+   parent = rb_parent(node);


Again here.

Seth



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] rbtree: add rbtree_postorder_for_each_entry_safe() helper

2013-07-29 Thread Cody P Schafer


On 07/29/2013 08:06 AM, Seth Jennings wrote:

On Fri, Jul 26, 2013 at 02:13:40PM -0700, Cody P Schafer wrote:

Because deletion (of the entire tree) is a relatively common use of the
rbtree_postorder iteration, and because doing it safely means fiddling
with temporary storage, provide a helper to simplify postorder rbtree
iteration.

Signed-off-by: Cody P Schafer 
---
  include/linux/rbtree.h | 17 +
  1 file changed, 17 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 2879e96..64ab98b 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -85,4 +85,21 @@ static inline void rb_link_node(struct rb_node * node, 
struct rb_node * parent,
*rb_link = node;
  }

+/**
+ * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of
+ * given type safe against removal of rb_node entry
+ *
+ * @pos:   the 'type *' to use as a loop cursor.
+ * @n: another 'type *' to use as temporary storage
+ * @root:  'rb_root *' of the rbtree.
+ * @field: the name of the rb_node field within 'type'.
+ */
+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \
+   for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\
+ n = rb_entry(rb_next_postorder(&pos->field), \
+ typeof(*pos), field); \
+&pos->field; \
+pos = n, \
+ n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field))


One too many spaces.  Also mix of tabs and spaces is weird, but
checkpatch doesn't complain so...

Seth


The extra space is to set off ';' vs ',' in the macro. And I did that 
instead of a tab to avoid wrapping. I've adjusted them (in the next 
version) to use the same style as list.h's list_for_each*() macros. 
Which results in more wrapping :( .





+
  #endif/* _LINUX_RBTREE_H */
--
1.8.3.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/5] rbtree: add postorder iteration functions

2013-07-29 Thread Cody P Schafer

Add postorder iteration functions for rbtree. These are useful for
safely freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer 
Reviewed-by: Seth Jennings 
---
 include/linux/rbtree.h |  4 
 lib/rbtree.c   | 40 
 2 files changed, 44 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 0022c1b..c467151 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -68,6 +68,10 @@ extern struct rb_node *rb_prev(const struct rb_node *);
 extern struct rb_node *rb_first(const struct rb_root *);
 extern struct rb_node *rb_last(const struct rb_root *);
 
+/* Postorder iteration - always visit the parent after its children */
+extern struct rb_node *rb_first_postorder(const struct rb_root *);
+extern struct rb_node *rb_next_postorder(const struct rb_node *);
+
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
 extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct 
rb_node *new,
*new = *victim;
 }
 EXPORT_SYMBOL(rb_replace_node);
+
+static struct rb_node *rb_left_deepest_node(const struct rb_node *node)
+{
+   for (;;) {
+   if (node->rb_left)
+   node = node->rb_left;
+   else if (node->rb_right)
+   node = node->rb_right;
+   else
+   return (struct rb_node *)node;
+   }
+}
+
+struct rb_node *rb_next_postorder(const struct rb_node *node)
+{
+   const struct rb_node *parent;
+   if (!node)
+   return NULL;
+   parent = rb_parent(node);
+
+   /* If we're sitting on node, we've already seen our children */
+   if (parent && node == parent->rb_left && parent->rb_right) {
+   /* If we are the parent's left node, go to the parent's right
+* node then all the way down to the left */
+   return rb_left_deepest_node(parent->rb_right);
+   } else
+   /* Otherwise we are the parent's right node, and the parent
+* should be next */
+   return (struct rb_node *)parent;
+}
+EXPORT_SYMBOL(rb_next_postorder);
+
+struct rb_node *rb_first_postorder(const struct rb_root *root)
+{
+   if (!root->rb_node)
+   return NULL;
+
+   return rb_left_deepest_node(root->rb_node);
+}
+EXPORT_SYMBOL(rb_first_postorder);
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 5/5] mm/zswap: use postorder iteration when destroying rbtree

2013-07-29 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
Reviewed-by: Seth Jennings 
---
 mm/zswap.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index deda2b6..5c853b2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -790,26 +790,14 @@ static void zswap_frontswap_invalidate_page(unsigned 
type, pgoff_t offset)
 static void zswap_frontswap_invalidate_area(unsigned type)
 {
struct zswap_tree *tree = zswap_trees[type];
-   struct rb_node *node;
-   struct zswap_entry *entry;
+   struct zswap_entry *entry, *n;
 
if (!tree)
return;
 
/* walk the tree and free everything */
spin_lock(&tree->lock);
-   /*
-* TODO: Even though this code should not be executed because
-* the try_to_unuse() in swapoff should have emptied the tree,
-* it is very wasteful to rebalance the tree after every
-* removal when we are freeing the whole tree.
-*
-* If post-order traversal code is ever added to the rbtree
-* implementation, it should be used here.
-*/
-   while ((node = rb_first(&tree->rbroot))) {
-   entry = rb_entry(node, struct zswap_entry, rbnode);
-   rb_erase(&entry->rbnode, &tree->rbroot);
+   rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) {
zbud_free(tree->pool, entry->handle);
zswap_entry_cache_free(entry);
atomic_dec(&zswap_stored_pages);
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 0/5] Add rbtree postorder iteration functions, runtime tests, and update zswap to use

2013-07-29 Thread Cody P Schafer

Postorder iteration yields all of a node's children prior to yielding the node
itself, and this particular implementation also avoids examining the leaf links
in a node after that node has been yielded.

In what I expect will be its most common usage, postorder iteration allows the
deletion of every node in an rbtree without modifying the rbtree nodes (no
_requirement_ that they be nulled) while avoiding referencing child nodes after
they have been "deleted" (most commonly, freed).

I have only updated zswap to use this functionality at this point, but numerous
bits of code (most notably in the filesystem drivers) use a hand rolled
postorder iteration that NULLs child links as it traverses the tree. Each of
those instances could be replaced with this common implementation.

1 & 2 add rbtree postorder iteration functions.
3 adds testing of the iteration to the rbtree runtime tests
4 allows building the rbtree runtime tests as builtins
5 updates zswap.

--
since v1:
- spacing
- s/it's/its/
- remove now unused var in zswap code.
- Reviewed-by: Seth Jennings 


Cody P Schafer (5):
  rbtree: add postorder iteration functions
  rbtree: add rbtree_postorder_for_each_entry_safe() helper
  rbtree_test: add test for postorder iteration
  rbtree: allow tests to run as builtin
  mm/zswap: use postorder iteration when destroying rbtree

 include/linux/rbtree.h | 22 ++
 lib/Kconfig.debug  |  2 +-
 lib/rbtree.c   | 40 
 lib/rbtree_test.c  | 12 
 mm/zswap.c | 16 ++--
 5 files changed, 77 insertions(+), 15 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/5] rbtree_test: add test for postorder iteration

2013-07-29 Thread Cody P Schafer

Just check that we examine all nodes in the tree for the postorder iteration.

Signed-off-by: Cody P Schafer 
Reviewed-by: Seth Jennings 
---
 lib/rbtree_test.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index 122f02f..31dd4cc 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -114,6 +114,16 @@ static int black_path_count(struct rb_node *rb)
return count;
 }
 
+static void check_postorder(int nr_nodes)
+{
+   struct rb_node *rb;
+   int count = 0;
+   for (rb = rb_first_postorder(&root); rb; rb = rb_next_postorder(rb))
+   count++;
+
+   WARN_ON_ONCE(count != nr_nodes);
+}
+
 static void check(int nr_nodes)
 {
struct rb_node *rb;
@@ -136,6 +146,8 @@ static void check(int nr_nodes)
 
WARN_ON_ONCE(count != nr_nodes);
WARN_ON_ONCE(count < (1 << black_path_count(rb_last(&root))) - 1);
+
+   check_postorder(nr_nodes);
 }
 
 static void check_augmented(int nr_nodes)
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/5] rbtree: add rbtree_postorder_for_each_entry_safe() helper

2013-07-29 Thread Cody P Schafer

Because deletion (of the entire tree) is a relatively common use of the
rbtree_postorder iteration, and because doing it safely means fiddling
with temporary storage, provide a helper to simplify postorder rbtree
iteration.

Signed-off-by: Cody P Schafer 
Reviewed-by: Seth Jennings 
---
 include/linux/rbtree.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index c467151..aa870a4 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -85,4 +85,22 @@ static inline void rb_link_node(struct rb_node * node, 
struct rb_node * parent,
*rb_link = node;
 }
 
+/**
+ * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of
+ * given type safe against removal of rb_node entry
+ *
+ * @pos:   the 'type *' to use as a loop cursor.
+ * @n: another 'type *' to use as temporary storage
+ * @root:  'rb_root *' of the rbtree.
+ * @field: the name of the rb_node field within 'type'.
+ */
+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \
+   for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\
+   n = rb_entry(rb_next_postorder(&pos->field), \
+   typeof(*pos), field); \
+&pos->field; \
+pos = n, \
+   n = rb_entry(rb_next_postorder(&pos->field), \
+   typeof(*pos), field))
+
 #endif /* _LINUX_RBTREE_H */
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/5] rbtree: allow tests to run as builtin

2013-07-29 Thread Cody P Schafer

No reason require rbtree test code to be a module, allow it to be
builtin (streamlines my development process)

Signed-off-by: Cody P Schafer 
Reviewed-by: Seth Jennings 
---
 lib/Kconfig.debug | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1501aa5..606e3c8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1442,7 +1442,7 @@ config BACKTRACE_SELF_TEST
 
 config RBTREE_TEST
tristate "Red-Black tree test"
-   depends on m && DEBUG_KERNEL
+   depends on DEBUG_KERNEL
help
  A benchmark measuring the performance of the rbtree library.
  Also includes rbtree invariant checks.
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/hotplug: fix a drain pcp bug when offline pages

2013-08-01 Thread Cody P Schafer


On 08/01/2013 02:18 AM, Xishi Qiu wrote:

__offline_pages()
start_isolate_page_range()
   set_migratetype_isolate()
  set_pageblock_migratetype() -> this pageblock will be marked as 
MIGRATE_ISOLATE
  move_freepages_block() -> pages in PageBuddy will be moved into 
MIGRATE_ISOLATE list
  drain_all_pages() -> drain PCP
 free_pcppages_bulk()
mt = get_freepage_migratetype(page); -> PCP's migratetype is 
not MIGRATE_ISOLATE
__free_one_page(page, zone, 0, mt); -> so PCP will not be freed 
into into MIGRATE_ISOLATE list

In this case, the PCP may be allocated again, because they are not in
PageBuddy's MIGRATE_ISOLATE list. This will cause offline_pages failed.

Signed-off-by: Xishi Qiu 
---
  mm/page_alloc.c |   10 ++
  mm/page_isolation.c |   15 ++-
  2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..d873471 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -965,11 +965,13 @@ int move_freepages(struct zone *zone,
}

order = page_order(page);
-   list_move(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
-   set_freepage_migratetype(page, migratetype);
+   if (get_freepage_migratetype(page) != migratetype) {
+   list_move(&page->lru,
+   &zone->free_area[order].free_list[migratetype]);
+   set_freepage_migratetype(page, migratetype);
+   pages_moved += 1 << order;
+   }
page += 1 << order;
-   pages_moved += 1 << order;


So this looks like it changes the return from move_freepages() to be the 
"pages moved" from "the pages now belonging to the passed migrate type".


The user of move_freepages_block()'s return value (and thus the return 
value of move_freepages()) in mm/page_alloc.c expects that it is the 
original meaning. The users in page_isolation.c expect it is the new 
meaning. Those need to be reconciled.



}

return pages_moved;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 383bdbb..ba1afc9 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -65,8 +65,21 @@ out:
}

spin_unlock_irqrestore(&zone->lock, flags);
-   if (!ret)
+
+   if (!ret) {
drain_all_pages();
+   /*
+* When drain_all_pages() frees cached pages into the buddy
+* system, it uses the stale migratetype cached in the
+* page->index field, so try to move free pages to ISOLATE
+* list again.
+*/
+   spin_lock_irqsave(&zone->lock, flags);
+   nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+   __mod_zone_freepage_state(zone, -nr_pages, migratetype);
+   spin_unlock_irqrestore(&zone->lock, flags);
+   }
+


Could we teach drain_all_pages() to use the right migrate type instead 
(or add something similar that does)? (pages could be reallocated 
between the drain_all_pages() and move_freepages_block()).



return ret;
  }



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] mm: make lru_add_drain_all() selective

2013-08-07 Thread Cody P Schafer


On 08/06/2013 01:22 PM, Chris Metcalf wrote:
[...]



  /**
+ * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * @func: the function to call
+ *
+ * schedule_on_each_cpu() executes @func on each online CPU using the
+ * system workqueue and blocks until all CPUs have completed.
+ * schedule_on_each_cpu() is very slow.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu(work_func_t func)
+{
+   get_online_cpus();
+   schedule_on_cpu_mask(func, cpu_online_mask);


Looks like the return value gets lost here.


+   put_online_cpus();
+   return 0;
+}
+
+/**
   * flush_scheduled_work - ensure that any scheduled work has run to 
completion.
   *
   * Forces execution of the kernel-global workqueue and blocks until its


[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 13/21] x86, acpi: Try to find SRAT in firmware earlier.

2013-07-23 Thread Cody P Schafer


On 07/19/2013 12:59 AM, Tang Chen wrote:

This patch introduce early_acpi_firmware_srat() to find the
phys addr of SRAT provided by firmware. And call it in
reserve_hotpluggable_memory().

Since we have initialized acpi_gbl_root_table_list earlier,
and store all the tables' phys addrs and signatures in it,
it is easy to find the SRAT.

Signed-off-by: Tang Chen 
---
  drivers/acpi/acpica/tbxface.c |   34 ++
  drivers/acpi/osl.c|   24 
  include/acpi/acpixf.h |4 
  include/linux/acpi.h  |4 
  mm/memory_hotplug.c   |   10 +++---
  5 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/acpica/tbxface.c b/drivers/acpi/acpica/tbxface.c
index ad11162..95f8d1b 100644
--- a/drivers/acpi/acpica/tbxface.c
+++ b/drivers/acpi/acpica/tbxface.c
@@ -181,6 +181,40 @@ acpi_status acpi_reallocate_root_table(void)
return_ACPI_STATUS(status);
  }

+/*
+ * acpi_get_table_desc - Get the acpi table descriptor of a specific table.
+ * @signature: The signature of the table to be found.
+ * @out_desc: The out returned descriptor.


The "@out_desc:" line looks funky. Also, I believe changes to this file 
need to go in via acpica & probably conform to their commenting standards?



+ *
+ * This function iterates acpi_gbl_root_table_list and find the specified
+ * table's descriptor.
+ *
+ * NOTE: The caller has the responsibility to allocate memory for @out_desc.
+ *
+ * Return AE_OK on success, AE_NOT_FOUND if the table is not found.
+ */
+acpi_status acpi_get_table_desc(char *signature,
+   struct acpi_table_desc *out_desc)
+{
+   int pos;
+
+   for (pos = 0;
+pos < acpi_gbl_root_table_list.current_table_count;
+pos++) {
+   if (!ACPI_COMPARE_NAME
+   (&(acpi_gbl_root_table_list.tables[pos].signature),
+   signature))
+   continue;
+
+   memcpy(out_desc, &acpi_gbl_root_table_list.tables[pos],
+  sizeof(struct acpi_table_desc));
+
+   return_ACPI_STATUS(AE_OK);
+   }
+
+   return_ACPI_STATUS(AE_NOT_FOUND);
+}
+
  
/***
   *
   * FUNCTION:acpi_get_table_header
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index fa6b973..a2e4596 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -53,6 +53,7 @@
  #include 
  #include 
  #include 
+#include 

  #define _COMPONENTACPI_OS_SERVICES
  ACPI_MODULE_NAME("osl");
@@ -750,6 +751,29 @@ void __init acpi_initrd_override(void *data, size_t size)
  }
  #endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */

+#ifdef CONFIG_ACPI_NUMA
+#include 
+#include 
+
+/*
+ * early_acpi_firmware_srat - Get the phys addr of SRAT provide by firmware.


s/provide/provided/


+ *
+ * This function iterate acpi_gbl_root_table_list, find SRAT and return the


Perhaps: "Iterate over acpi_gbl_root_table_list to find SRAT then return 
its phys addr"


Though I wonder if this comment is even needed, as the iteration is done 
in acpi_get_table_desc() (added above).



+ * phys addr of SRAT.
+ *
+ * Return the phys addr of SRAT, or 0 on error.





+ */
+phys_addr_t __init early_acpi_firmware_srat()
+{
+   struct acpi_table_desc table_desc;
+
+   if (acpi_get_table_desc(ACPI_SIG_SRAT, &table_desc))
+   return 0;
+
+   return table_desc.address;
+}
+#endif /* CONFIG_ACPI_NUMA */
+
  static void acpi_table_taint(struct acpi_table_header *table)
  {
pr_warn(PREFIX


[...]


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 066873e..15b11d3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,10 +106,14 @@ void __init reserve_hotpluggable_memory(void)
  {
phys_addr_t srat_paddr;

-   /* Try to find if SRAT is overrided */
+   /* Try to find out if SRAT is overrided */
srat_paddr = early_acpi_override_srat();
-   if (!srat_paddr)
-   return;
+   if (!srat_paddr) {
+   /* Try to find SRAT from firmware if it wasn't overrided */


s/overrided/overridden/


+   srat_paddr = early_acpi_firmware_srat();
+   if (!srat_paddr)
+   return;
+   }

/* Will reserve hotpluggable memory here */
  }



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-05-13 Thread Cody P Schafer

"Problems" with the current code:
 1. there is a lack of synchronization in setting ->high and ->batch in
percpu_pagelist_fraction_sysctl_handler()
 2. stop_machine() in zone_pcp_update() is unnecissary.
 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction 
is non-zero

To fix:
 1. add memory barriers, a safe ->batch value, an update side mutex when
updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that
expect a stable value.
 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers 
added to fix #1
 3. factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(), which is
used by memory_hotplug. I'm rather certain that I've diserned (and preserved)
the essential behavior (changing ->high and ->batch), and only eliminated
unneeded actions (draining the per cpu pages), but this may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to offline a
page, and after the offline has "succeeded". It appears that the draining was
added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
funtions.

--

Since v2: https://lkml.org/lkml/2013/4/9/718

 - note ACCESS_ONCE() in fix #1 above.
 - consolidate ->batch & ->high update protocol into a single funtion (Gilad).
 - add missing ACCESS_ONCE() on ->batch

Since v1: https://lkml.org/lkml/2013/4/5/444

 - instead of using on_each_cpu(), use memory barriers (Gilad) and an update 
side mutex.
 - add "Problem" #3 above, and fix.
 - rename function to match naming style of similar function
 - move unrelated comment

--

Cody P Schafer (11):
  mm/page_alloc: factor out setting of pcp->high and pcp->batch.
  mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
  mm/page_alloc: insert memory barriers to allow async update of pcp
batch and high
  mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE
  mm/page_alloc: convert zone_pcp_update() to rely on memory barriers
instead of stop_machine()
  mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly
recalulate high
  mm/page_alloc: factor setup_pageset() into pageset_init() and
pageset_set_batch()
  mm/page_alloc: relocate comment to be directly above code it refers
to.
  mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
  mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
  mm/page_alloc: rename setup_pagelist_highmark() to match naming of
pageset_set_batch()

 mm/page_alloc.c | 151 +---
 1 file changed, 90 insertions(+), 61 deletions(-)

-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 01/11] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

2013-05-13 Thread Cody P Schafer

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high
based on ->batch.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 98cbdf6..9e556e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4030,6 +4030,14 @@ static int __meminit zone_batchsize(struct zone *zone)
 #endif
 }
 
+/* a companion to setup_pagelist_highmark() */
+static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
+{
+   struct per_cpu_pages *pcp = &p->pcp;
+   pcp->high = 6 * batch;
+   pcp->batch = max(1UL, 1 * batch);
+}
+
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
struct per_cpu_pages *pcp;
@@ -4039,8 +4047,7 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 
pcp = &p->pcp;
pcp->count = 0;
-   pcp->high = 6 * batch;
-   pcp->batch = max(1UL, 1 * batch);
+   pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
@@ -4049,7 +4056,6 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
  * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
  * to the value high for the pageset p.
  */
-
 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
 {
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 02/11] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

2013-05-13 Thread Cody P Schafer

Because we are going to rely upon a careful transision between old and
new ->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e556e6..cea883d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -65,6 +65,9 @@
 #include 
 #include "internal.h"
 
+/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
+static DEFINE_MUTEX(pcp_batch_high_lock);
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -,6 +5558,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (!write || (ret < 0))
return ret;
+
+   mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
for_each_possible_cpu(cpu) {
unsigned long  high;
@@ -5563,6 +5568,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
per_cpu_ptr(zone->pageset, cpu), high);
}
}
+   mutex_unlock(&pcp_batch_high_lock);
return 0;
 }
 
@@ -6076,7 +6082,9 @@ static int __meminit __zone_pcp_update(void *data)
 
 void __meminit zone_pcp_update(struct zone *zone)
 {
+   mutex_lock(&pcp_batch_high_lock);
stop_machine(__zone_pcp_update, zone, NULL);
+   mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
 
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 05/11] mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()

2013-05-13 Thread Cody P Schafer

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.

This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).

This does change the behavior of zone_pcp_update() as the percpu
pagesets will not be drained when zone_pcp_update() is called (they will
end up being shrunk, not completely drained, later when a 0-order page
is freed in free_hot_cold_page()).

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 33 +
 1 file changed, 9 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 71d843d..5a00195 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6083,33 +6083,18 @@ void free_contig_range(unsigned long pfn, unsigned 
nr_pages)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit __zone_pcp_update(void *data)
-{
-   struct zone *zone = data;
-   int cpu;
-   unsigned long batch = zone_batchsize(zone), flags;
-
-   for_each_possible_cpu(cpu) {
-   struct per_cpu_pageset *pset;
-   struct per_cpu_pages *pcp;
-
-   pset = per_cpu_ptr(zone->pageset, cpu);
-   pcp = &pset->pcp;
-
-   local_irq_save(flags);
-   if (pcp->count > 0)
-   free_pcppages_bulk(zone, pcp->count, pcp);
-   drain_zonestat(zone, pset);
-   setup_pageset(pset, batch);
-   local_irq_restore(flags);
-   }
-   return 0;
-}
-
+/*
+ * The zone indicated has a new number of managed_pages; batch sizes and percpu
+ * page high values need to be recalulated.
+ */
 void __meminit zone_pcp_update(struct zone *zone)
 {
+   unsigned cpu;
+   unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
-   stop_machine(__zone_pcp_update, zone, NULL);
+   batch = zone_batchsize(zone);
+   for_each_possible_cpu(cpu)
+   pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 07/11] mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()

2013-05-13 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3583281..696ce96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4069,7 +4069,7 @@ static void pageset_set_batch(struct per_cpu_pageset *p, 
unsigned long batch)
pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
 }
 
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void pageset_init(struct per_cpu_pageset *p)
 {
struct per_cpu_pages *pcp;
int migratetype;
@@ -4078,11 +4078,16 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 
pcp = &p->pcp;
pcp->count = 0;
-   pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+   pageset_init(p);
+   pageset_set_batch(p, batch);
+}
+
 /*
  * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
  * to the value high for the pageset p.
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 09/11] mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()

2013-05-13 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 53c62c5..0c3cdbb6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4102,22 +4102,25 @@ static void setup_pagelist_highmark(struct 
per_cpu_pageset *p,
pageset_update(&p->pcp, high, batch);
 }
 
+static void __meminit zone_pageset_init(struct zone *zone, int cpu)
+{
+   struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+   pageset_init(pcp);
+   if (percpu_pagelist_fraction)
+   setup_pagelist_highmark(pcp,
+   (zone->managed_pages /
+   percpu_pagelist_fraction));
+   else
+   pageset_set_batch(pcp, zone_batchsize(zone));
+}
+
 static void __meminit setup_zone_pageset(struct zone *zone)
 {
int cpu;
-
zone->pageset = alloc_percpu(struct per_cpu_pageset);
-
-   for_each_possible_cpu(cpu) {
-   struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
-
-   setup_pageset(pcp, zone_batchsize(zone));
-
-   if (percpu_pagelist_fraction)
-   setup_pagelist_highmark(pcp,
-   (zone->managed_pages /
-   percpu_pagelist_fraction));
-   }
+   for_each_possible_cpu(cpu)
+   zone_pageset_init(zone, cpu);
 }
 
 /*
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 10/11] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

2013-05-13 Thread Cody P Schafer

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0. Correct this by
calling zone_pageset_init(), which chooses the appropriate ->batch and
->high calculations.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c3cdbb6..251fb5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6096,11 +6096,9 @@ void free_contig_range(unsigned long pfn, unsigned 
nr_pages)
 void __meminit zone_pcp_update(struct zone *zone)
 {
unsigned cpu;
-   unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
-   batch = zone_batchsize(zone);
for_each_possible_cpu(cpu)
-   pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
+   zone_pageset_init(zone, cpu);
mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 08/11] mm/page_alloc: relocate comment to be directly above code it refers to.

2013-05-13 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 696ce96..53c62c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3709,12 +3709,12 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct 
zone *zone)
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
} else {
-   /* we have to stop all cpus to guarantee there is no user
-  of zonelist */
 #ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
setup_zone_pageset(zone);
 #endif
+   /* we have to stop all cpus to guarantee there is no user
+  of zonelist */
stop_machine(__build_all_zonelists, pgdat, NULL);
/* cpuset refresh routine should be here */
}
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 11/11] mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()

2013-05-13 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 251fb5f..b335c98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4063,7 +4063,7 @@ static void pageset_update(struct per_cpu_pages *pcp, 
unsigned long high,
pcp->batch = batch;
 }
 
-/* a companion to setup_pagelist_highmark() */
+/* a companion to pageset_set_high() */
 static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
 {
pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
@@ -4089,10 +4089,10 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 }
 
 /*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
  * to the value high for the pageset p.
  */
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+static void pageset_set_high(struct per_cpu_pageset *p,
unsigned long high)
 {
unsigned long batch = max(1UL, high / 4);
@@ -4108,7 +4108,7 @@ static void __meminit zone_pageset_init(struct zone 
*zone, int cpu)
 
pageset_init(pcp);
if (percpu_pagelist_fraction)
-   setup_pagelist_highmark(pcp,
+   pageset_set_high(pcp,
(zone->managed_pages /
percpu_pagelist_fraction));
else
@@ -5597,8 +5597,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
unsigned long  high;
high = zone->managed_pages / percpu_pagelist_fraction;
for_each_possible_cpu(cpu)
-   setup_pagelist_highmark(
-   per_cpu_ptr(zone->pageset, cpu), high);
+   pageset_set_high(per_cpu_ptr(zone->pageset, cpu),
+high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 06/11] mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high

2013-05-13 Thread Cody P Schafer

Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a00195..3583281 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5573,7 +5573,6 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, 
int write,
  * cpu.  It is the fraction of total pages in each zone that a hot per cpu 
pagelist
  * can have before it gets flushed back to buddy allocator.
  */
-
 int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -5587,12 +5586,11 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
 
mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
-   for_each_possible_cpu(cpu) {
-   unsigned long  high;
-   high = zone->managed_pages / percpu_pagelist_fraction;
+   unsigned long  high;
+   high = zone->managed_pages / percpu_pagelist_fraction;
+   for_each_possible_cpu(cpu)
setup_pagelist_highmark(
-   per_cpu_ptr(zone->pageset, cpu), high);
-   }
+   per_cpu_ptr(zone->pageset, cpu), high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 03/11] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

2013-05-13 Thread Cody P Schafer

Introduce pageset_update() to perform a safe transision from one set of
pcp->{batch,high} to a new set using memory barriers.

This ensures that batch is always set to a safe value (1) prior to
updating high, and ensure that high is fully updated before setting the
real value of batch. It avoids ->batch ever rising above ->high.

Suggested by Gilad Ben-Yossef  in these threads:

https://lkml.org/lkml/2013/4/9/23
https://lkml.org/lkml/2013/4/10/49

Also reproduces his proposed comment.

Reviewed-by: Gilad Ben-Yossef 
Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 41 -
 1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cea883d..7e45b91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4033,12 +4033,37 @@ static int __meminit zone_batchsize(struct zone *zone)
 #endif
 }
 
+/*
+ * pcp->high and pcp->batch values are related and dependent on one another:
+ * ->batch must never be higher then ->high.
+ * The following function updates them in a safe manner without read side
+ * locking.
+ *
+ * Any new users of pcp->batch and pcp->high should ensure they can cope with
+ * those fields changing asynchronously (acording the the above rule).
+ *
+ * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
+ * outside of boot time (or some other assurance that no concurrent updaters
+ * exist).
+ */
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
+   unsigned long batch)
+{
+   /* start with a fail safe value for batch */
+   pcp->batch = 1;
+   smp_wmb();
+
+   /* Update high, then batch, in order */
+   pcp->high = high;
+   smp_wmb();
+
+   pcp->batch = batch;
+}
+
 /* a companion to setup_pagelist_highmark() */
 static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
 {
-   struct per_cpu_pages *pcp = &p->pcp;
-   pcp->high = 6 * batch;
-   pcp->batch = max(1UL, 1 * batch);
+   pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
 }
 
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
@@ -4062,13 +4087,11 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
 {
-   struct per_cpu_pages *pcp;
+   unsigned long batch = max(1UL, high / 4);
+   if ((high / 4) > (PAGE_SHIFT * 8))
+   batch = PAGE_SHIFT * 8;
 
-   pcp = &p->pcp;
-   pcp->high = high;
-   pcp->batch = max(1UL, high/4);
-   if ((high/4) > (PAGE_SHIFT * 8))
-   pcp->batch = PAGE_SHIFT * 8;
+   pageset_update(&p->pcp, high, batch);
 }
 
 static void __meminit setup_zone_pageset(struct zone *zone)
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND v3 04/11] mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE

2013-05-13 Thread Cody P Schafer

pcp->batch could change at any point, avoid relying on it being a stable value.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e45b91..71d843d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1182,10 +1182,12 @@ void drain_zone_pages(struct zone *zone, struct 
per_cpu_pages *pcp)
 {
unsigned long flags;
int to_drain;
+   unsigned long batch;
 
local_irq_save(flags);
-   if (pcp->count >= pcp->batch)
-   to_drain = pcp->batch;
+   batch = ACCESS_ONCE(pcp->batch);
+   if (pcp->count >= batch)
+   to_drain = batch;
else
to_drain = pcp->count;
if (to_drain > 0) {
@@ -1353,8 +1355,9 @@ void free_hot_cold_page(struct page *page, int cold)
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
if (pcp->count >= pcp->high) {
-   free_pcppages_bulk(zone, pcp->batch, pcp);
-   pcp->count -= pcp->batch;
+   unsigned long batch = ACCESS_ONCE(pcp->batch);
+   free_pcppages_bulk(zone, batch, pcp);
+   pcp->count -= batch;
}
 
 out:
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-05-13 Thread Cody P Schafer


On 05/13/2013 12:20 PM, Pekka Enberg wrote:

Hi Cody,

On Mon, May 13, 2013 at 10:08 PM, Cody P Schafer
 wrote:

"Problems" with the current code:
  1. there is a lack of synchronization in setting ->high and ->batch in
 percpu_pagelist_fraction_sysctl_handler()
  2. stop_machine() in zone_pcp_update() is unnecissary.
  3. zone_pcp_update() does not consider the case where 
percpu_pagelist_fraction is non-zero


Maybe it's just me but I find the above problem description confusing.
How does the problem manifest itself?


1. I've not reproduced this causing issues.
2. Calling zone_pcp_update() is slow.
3. Not reproduced either, but would cause percpu_pagelist_fraction (set 
via sysctl) to be ignored after a call to zone_pcp_update() (for 
example, after a memory hotplug).



How did you find about it?


I'm writing some code that resizes zones and thus uses 
zone_pcp_update(), and fixing broken things along the way.



Why
do we need to fix all three problems in the same patch set?


They all affect the same bit of code and fixing all of them means 
restructuring the both of the sites where ->high and ->batch are set.


Additionally, splitting it out (if possible) would make it less clear 
what the overall goal is, and would mean a few inter-patchset 
dependencies, which are undesirable.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 0/4] misc patches related to resizing nodes & zones

2013-05-13 Thread Cody P Schafer

First 2 are comment fixes.
Second 2 add pgdat_resize_lock()/unlock() usage per existing documentation.

--

Since v2 (http://comments.gmane.org/gmane.linux.kernel.mm/99316):
 - add ack on patch 1 from rientjes.
 - quote documentation in patch 3 & 4.

--

Since v1 (http://thread.gmane.org/gmane.linux.kernel.mm/99297):
  - drop making lock_memory_hotplug() required (old patch #1)
  - fix __offline_pages() in the same manner as online_pages() (rientjes)
  - make comment regarding pgdat_resize_lock()/unlock() usage more clear 
(rientjes)


Cody P Schafer (4):
  mm: fix comment referring to non-existent size_seqlock, change to
span_seqlock
  mmzone: note that node_size_lock should be manipulated via
pgdat_resize_lock()
  memory_hotplug: use pgdat_resize_lock() in online_pages()
  memory_hotplug: use pgdat_resize_lock() in __offline_pages()

 include/linux/mmzone.h | 5 -
 mm/memory_hotplug.c| 9 +
 2 files changed, 13 insertions(+), 1 deletion(-)

-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 1/4] mm: fix comment referring to non-existent size_seqlock, change to span_seqlock

2013-05-13 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
Acked-by: David Rientjes 
---
 include/linux/mmzone.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5c76737..fc859a0c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -716,7 +716,7 @@ typedef struct pglist_data {
 * or node_spanned_pages stay constant.  Holding this will also
 * guarantee that any pfn_valid() stays that way.
 *
-* Nests above zone->lock and zone->size_seqlock.
+* Nests above zone->lock and zone->span_seqlock
 */
spinlock_t node_size_lock;
 #endif
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 2/4] mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()

2013-05-13 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 include/linux/mmzone.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc859a0c..41557be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -716,6 +716,9 @@ typedef struct pglist_data {
 * or node_spanned_pages stay constant.  Holding this will also
 * guarantee that any pfn_valid() stays that way.
 *
+* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
+* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG.
+*
 * Nests above zone->lock and zone->span_seqlock
 */
spinlock_t node_size_lock;
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 3/4] memory_hotplug: use pgdat_resize_lock() in online_pages()

2013-05-13 Thread Cody P Schafer

mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
follows:

* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant.  [...]

So actually hold it when we update node_present_pages in online_pages().

Signed-off-by: Cody P Schafer 
---
 mm/memory_hotplug.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a221fac..0bdca10 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -915,6 +915,7 @@ static void node_states_set_node(int node, struct 
memory_notify *arg)
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int 
online_type)
 {
+   unsigned long flags;
unsigned long onlined_pages = 0;
struct zone *zone;
int need_zonelists_rebuild = 0;
@@ -993,7 +994,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages, int online_typ
 
zone->managed_pages += onlined_pages;
zone->present_pages += onlined_pages;
+
+   pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages += onlined_pages;
+   pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
if (onlined_pages) {
node_states_set_node(zone_to_nid(zone), &arg);
if (need_zonelists_rebuild)
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 4/4] memory_hotplug: use pgdat_resize_lock() in __offline_pages()

2013-05-13 Thread Cody P Schafer

mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
follows:

* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant.  [...]

So actually hold it when we update node_present_pages in __offline_pages().

Signed-off-by: Cody P Schafer 
---
 mm/memory_hotplug.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0bdca10..b59a695 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1582,7 +1582,11 @@ repeat:
/* removal success */
zone->managed_pages -= offlined_pages;
zone->present_pages -= offlined_pages;
+
+   pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages -= offlined_pages;
+   pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
totalram_pages -= offlined_pages;
 
init_per_zone_wmark_min();
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] checkpatch: avoid warning on page flag acessors

2013-05-14 Thread Cody P Schafer

Using Page*() triggers a camelcase warning, but shouldn't.

Introduced by be987d9f80354e2e919926349282facd74992f90, which added the
other Page flag users.

Pipe ('|') at the end of a grouping doesn't cause the grouping to match
an emtpy string, use ? after the grouping instead to get the desired
behavior.

Signed-off-by: Cody P Schafer 
---
 scripts/checkpatch.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b954de5..ee24026 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2939,7 +2939,7 @@ sub process {
my $var = $1;
if ($var !~ /$Constant/ &&
$var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ &&
-   $var !~ 
/"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ &&
+   $var !~ 
/"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ &&
!defined $camelcase{$var}) {
$camelcase{$var} = 1;
WARN("CAMELCASE",
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] checkpatch: avoid warning on page flag acessors

2013-05-14 Thread Cody P Schafer


On 05/14/2013 09:45 AM, Cody P Schafer wrote:

Using Page*() triggers a camelcase warning, but shouldn't.

Introduced by be987d9f80354e2e919926349282facd74992f90, which added the
other Page flag users.

Pipe ('|') at the end of a grouping doesn't cause the grouping to match
an emtpy string, use ? after the grouping instead to get the desired
behavior.

Signed-off-by: Cody P Schafer 
---
  scripts/checkpatch.pl | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b954de5..ee24026 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2939,7 +2939,7 @@ sub process {
my $var = $1;
if ($var !~ /$Constant/ &&
$var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ &&
-   $var !~ /"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ 
&&
+   $var !~ /"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ 
&&
!defined $camelcase{$var}) {
$camelcase{$var} = 1;
WARN("CAMELCASE",




Nix this, appears to still be broken

WARNING: Avoid CamelCase: 
#51: FILE: mm/page_alloc.c:449:
+   VM_BUG_ON(!PageBuddy(page));

total: 0 errors, 1 warnings, 46 lines checked


/home/cody/mm/0008-page_alloc-add-return_pages_to_zone-when-DYNAMIC_NUM.patch 
has style problems, please review.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] checkpatch: avoid warning on page flag acessors

2013-05-14 Thread Cody P Schafer

Using Page*() triggers a camelcase warning, but shouldn't.

be987d9f80354e2e919926349282facd74992f90 added a spurious '"' (double
quote) breaking the regex.

Signed-off-by: Cody P Schafer 
---
 scripts/checkpatch.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b954de5..ee24026 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2939,7 +2939,7 @@ sub process {
my $var = $1;
if ($var !~ /$Constant/ &&
$var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ &&
-   $var !~ 
/"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ &&
+   $var !~ 
/"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ &&
!defined $camelcase{$var}) {
$camelcase{$var} = 1;
WARN("CAMELCASE",
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] checkpatch: avoid warning on page flag acessors

2013-05-14 Thread Cody P Schafer

Using Page*() triggers a camelcase warning, but shouldn't.

be987d9f80354e2e919926349282facd74992f90 added a spurious '"' (double
quote) breaking the regex.

Signed-off-by: Cody P Schafer 
---
 scripts/checkpatch.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

---

Appologies for the noise, I updated my commit message but not the patch in the 
previous version.

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b954de5..ee24026 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2939,7 +2939,7 @@ sub process {
my $var = $1;
if ($var !~ /$Constant/ &&
$var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ &&
-   $var !~ 
/"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ &&
+   $var !~ 
/"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ &&
!defined $camelcase{$var}) {
$camelcase{$var} = 1;
WARN("CAMELCASE",
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] checkpatch: avoid warning on page flag acessors

2013-05-14 Thread Cody P Schafer


On 05/14/2013 10:13 AM, Joe Perches wrote:

I think you need to delete the ", leave the | and remove the ?
Dunno where that " came from.


Ya. commit goof on my part, see v3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4] checkpatch: avoid warning on page flag acessors

2013-05-14 Thread Cody P Schafer

Using Page*() triggers a camelcase warning, but shouldn't.

be987d9f80354e2e919926349282facd74992f90 added a spurious '"' (double
quote) breaking the regex.

Signed-off-by: Cody P Schafer 
---
 scripts/checkpatch.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

---
And using sha1s in format-patch means I have to actually copy the new ones. 
(Again, sorry).

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b954de5..d2fe01d 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2939,7 +2939,7 @@ sub process {
my $var = $1;
if ($var !~ /$Constant/ &&
$var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ &&
-   $var !~ 
/"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ &&
+   $var !~ 
/^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ &&
!defined $camelcase{$var}) {
$camelcase{$var} = 1;
WARN("CAMELCASE",
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-08 Thread Cody P Schafer

On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote:
> (4/5/13 4:33 PM), Cody P Schafer wrote:
>> In one case while modifying the ->high and ->batch fields of per cpu pagesets
>> we're unneededly using stop_machine() (patches 1 & 2), and in another we 
>> don't have any
>> syncronization at all (patch 3).
>>
>> This patchset fixes both of them.
>>
>> Note that it results in a change to the behavior of zone_pcp_update(), which 
>> is
>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
>> essential behavior (changing ->high and ->batch), and only eliminated 
>> unneeded
>> actions (draining the per cpu pages), but this may not be the case.
> 
> at least, memory hotplug need to drain.

Could you explain why the drain is required here? From what I can tell,
after the stop_machine() completes, the per cpu page sets could be
repopulated at any point, making the combination of draining and
modifying ->batch & ->high uneeded.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-08 Thread Cody P Schafer


On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:

On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer  wrote:

In free_hot_cold_page(), we rely on pcp->batch remaining stable.
Updating it without being on the cpu owning the percpu pageset
potentially destroys this stability.

Change for_each_cpu() to on_each_cpu() to fix.


Are you referring to this? -


This was the case I noticed.



1329 if (pcp->count >= pcp->high) {
1330 free_pcppages_bulk(zone, pcp->batch, pcp);
1331 pcp->count -= pcp->batch;
1332 }

I'm probably missing the obvious but won't it be simpler to do this in
  free_hot_cold_page() -

1329 if (pcp->count >= pcp->high) {
1330  unsigned int batch = ACCESS_ONCE(pcp->batch);
1331 free_pcppages_bulk(zone, batch, pcp);
1332 pcp->count -= batch;
1333 }



Potentially, yes. Note that this was simply the one case I noticed, 
rather than certainly the only case.


I also wonder whether there could be unexpected interactions between 
->high and ->batch not changing together atomically. For example, could 
adjusting this knob cause ->batch to rise enough that it is greater than 
the previous ->high? If the code above then runs with the previous 
->high, ->count wouldn't be correct (checking this inside 
free_pcppages_bulk() might help on this one issue).



Now the batch value used is stable and you don't have to IPI every CPU
in the system just to change a config knob...


Is this really considered an issue? I wouldn't have expected someone to 
adjust the config knob often enough (or even more than once) to cause 
problems. Of course as a "It'd be nice" thing, I completely agree.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-08 Thread Cody P Schafer

On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
> (4/5/13 4:33 PM), Cody P Schafer wrote:
>> No off-cpu users of the percpu pagesets exist.
>>
>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
>> percpu pageset based on a zone's ->managed_pages. We don't need to drain
>> the entire percpu pageset just to modify these fields. Avoid calling
>> setup_pageset() (and the draining required to call it) and instead just
>> set the fields' values.
>>
>> This does change the behavior of zone_pcp_update() as the percpu
>> pagesets will not be drained when zone_pcp_update() is called (they will
>> end up being shrunk, not completely drained, later when a 0-order page
>> is freed in free_hot_cold_page()).
>>
>> Signed-off-by: Cody P Schafer 
> 
> NAK.
> 
> 1) zone_pcp_update() is only used from memory hotplug and it require page 
> drain.

I'm looking at this code because I'm currently working on a patchset
which adds another interface which modifies zone sizes, so "only used
from memory hotplug" is a temporary thing (unless I discover that
zone_pcp_update() is not intended to do what I want it to do).

> 2) stop_machin is used for avoiding race. just removing it is insane.

What race? Is there a cross cpu access to ->high & ->batch that makes
using on_each_cpu() instead of stop_machine() inappropriate? It is
absolutely not just being removed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-08 Thread Cody P Schafer


On 04/06/2013 06:56 PM, Simon Jeons wrote:

Hi Cody,
On 04/06/2013 04:33 AM, Cody P Schafer wrote:

In free_hot_cold_page(), we rely on pcp->batch remaining stable.
Updating it without being on the cpu owning the percpu pageset
potentially destroys this stability.


If cpu is off, can its pcp pageset be used in free_hot_code_page()?



I don't expect it to be as we use this_cpu_ptr() to access the pcp 
pageset. Unless there is some crazy mode where we can override the cpu a 
task believes it is running on...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

2013-04-08 Thread Cody P Schafer


On 04/06/2013 06:37 PM, Simon Jeons wrote:

Hi Cody,
On 04/06/2013 04:33 AM, Cody P Schafer wrote:

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high


Why need adjust pcp->high, pcp->batch during system running? What's the
requirement?



There is currently a sysctl (which I patch later in this series) which 
allows adjusting the ->high mark (and, indirectly, ->batch). 
Additionally, memory hotplug changes ->high and ->batch due to the zone 
size changing (essentially, zone->managed_pages and zone->present_pages 
have changed) , meaning that zone_batchsize(), which is used at boot to 
set ->batch and (indirectly) ->high has a different output.


Note that in addition to the 2 users of this functionality mentioned 
here, I'm currently working on anther resizer of zones (runtime NUMA 
reconfiguration).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-08 Thread Cody P Schafer


On 04/06/2013 06:32 PM, Simon Jeons wrote:

Hi Cody,
On 04/06/2013 04:33 AM, Cody P Schafer wrote:

In one case while modifying the ->high and ->batch fields of per cpu
pagesets
we're unneededly using stop_machine() (patches 1 & 2), and in another
we don't have any
syncronization at all (patch 3).


Do you mean stop_machine() is used for syncronization between each
online cpu?



I mean that it looks like synchronization between cpus is unneeded 
because of how per cpu pagesets are used, so stop_machine() (which does 
provide syncro between all cpus) is unnecessarily "strong".


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-08 Thread Cody P Schafer


On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote:

(4/8/13 1:32 PM), Cody P Schafer wrote:

On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:

(4/5/13 4:33 PM), Cody P Schafer wrote:

No off-cpu users of the percpu pagesets exist.

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields. Avoid calling
setup_pageset() (and the draining required to call it) and instead just
set the fields' values.

This does change the behavior of zone_pcp_update() as the percpu
pagesets will not be drained when zone_pcp_update() is called (they will
end up being shrunk, not completely drained, later when a 0-order page
is freed in free_hot_cold_page()).

Signed-off-by: Cody P Schafer 


NAK.

1) zone_pcp_update() is only used from memory hotplug and it require page drain.


I'm looking at this code because I'm currently working on a patchset
which adds another interface which modifies zone sizes, so "only used
from memory hotplug" is a temporary thing (unless I discover that
zone_pcp_update() is not intended to do what I want it to do).


maybe yes, maybe no. I don't know temporary or not. However the fact is,
you must not break anywhere. You need to look all caller always.


Right, which is why I want to understand memory hotplug's actual 
requirements.



2) stop_machin is used for avoiding race. just removing it is insane.


What race? Is there a cross cpu access to ->high & ->batch that makes
using on_each_cpu() instead of stop_machine() inappropriate? It is
absolutely not just being removed.


OK, I missed that. however your code is still wrong.
However you can't call free_pcppages_bulk() from interrupt context and
then you can't use on_each_cpu() anyway.


Given drain_pages() implementation, I find that hard to believe (It uses 
on_each_cpu_mask() and eventually calls free_pcppages_bulk()).


Can you provide a reference backing up your statement?

If this turns out to be an issue, schedule_on_each_cpu() could be an 
alternative.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-08 Thread Cody P Schafer


On 04/08/2013 10:28 AM, Cody P Schafer wrote:

On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:

On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer
 wrote:

In free_hot_cold_page(), we rely on pcp->batch remaining stable.
Updating it without being on the cpu owning the percpu pageset
potentially destroys this stability.

Change for_each_cpu() to on_each_cpu() to fix.


Are you referring to this? -


This was the case I noticed.



1329 if (pcp->count >= pcp->high) {
1330 free_pcppages_bulk(zone, pcp->batch, pcp);
1331 pcp->count -= pcp->batch;
1332 }

I'm probably missing the obvious but won't it be simpler to do this in
  free_hot_cold_page() -

1329 if (pcp->count >= pcp->high) {
1330  unsigned int batch = ACCESS_ONCE(pcp->batch);
1331 free_pcppages_bulk(zone, batch, pcp);
1332 pcp->count -= batch;
1333 }



Potentially, yes. Note that this was simply the one case I noticed,
rather than certainly the only case.

I also wonder whether there could be unexpected interactions between
->high and ->batch not changing together atomically. For example, could
adjusting this knob cause ->batch to rise enough that it is greater than
the previous ->high? If the code above then runs with the previous
->high, ->count wouldn't be correct (checking this inside
free_pcppages_bulk() might help on this one issue).


Now the batch value used is stable and you don't have to IPI every CPU
in the system just to change a config knob...


Is this really considered an issue? I wouldn't have expected someone to
adjust the config knob often enough (or even more than once) to cause
problems. Of course as a "It'd be nice" thing, I completely agree.


Would using schedule_on_each_cpu() instead of on_each_cpu() be an 
improvement, in your opinion?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-08 Thread Cody P Schafer

On 04/08/2013 12:08 PM, KOSAKI Motohiro wrote:
> (4/8/13 1:16 PM), Cody P Schafer wrote:
>> On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote:
>>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>>> In one case while modifying the ->high and ->batch fields of per cpu 
>>>> pagesets
>>>> we're unneededly using stop_machine() (patches 1 & 2), and in another we 
>>>> don't have any
>>>> syncronization at all (patch 3).
>>>>
>>>> This patchset fixes both of them.
>>>>
>>>> Note that it results in a change to the behavior of zone_pcp_update(), 
>>>> which is
>>>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
>>>> essential behavior (changing ->high and ->batch), and only eliminated 
>>>> unneeded
>>>> actions (draining the per cpu pages), but this may not be the case.
>>>
>>> at least, memory hotplug need to drain.
>>
>> Could you explain why the drain is required here? From what I can tell,
>> after the stop_machine() completes, the per cpu page sets could be
>> repopulated at any point, making the combination of draining and
>> modifying ->batch & ->high uneeded.
> 
> Then, memory hotplug again and again try to drain. Moreover hotplug prevent 
> repopulation
> by using MIGRATE_ISOLATE.
> pcp never be page count == 0 and it prevent memory hot remove.

zone_pcp_update() is not part of the drain retry loop, in the offline
pages case, it is only called following success in "removal" (online
pages also calls zone_pcp_update(), I don't see how it could benifit in
any way from draining the per cpu pagesets).

So I still don't see where the need for draining pages combined with
modifying ->high and ->batch came from.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-08 Thread Cody P Schafer


On 04/08/2013 03:18 PM, KOSAKI Motohiro wrote:

(4/8/13 3:49 PM), Cody P Schafer wrote:>

If this turns out to be an issue, schedule_on_each_cpu() could be an
alternative.


no way. schedule_on_each_cpu() is more problematic and it should be removed
in the future.
schedule_on_each_cpu() can only be used when caller task don't have any lock.
otherwise it may make deadlock.


I wasn't aware of that. Just to be clear, the deadlock you're referring 
to isn't the same one refered to in


commit b71ab8c2025caef8db719aa41af0ed735dc543cd
Author: Tejun Heo 
Date:   Tue Jun 29 10:07:14 2010 +0200
workqueue: increase max_active of keventd and kill current_is_keventd()

and

commit 65a64464349883891e21e74af16c05d6e1eeb4e9
Author: Andi Kleen 
Date:   Wed Oct 14 06:22:47 2009 +0200
HWPOISON: Allow schedule_on_each_cpu() from keventd

If you're referencing some other deadlock, could you please provide a 
link to the relevant discussion? (I'd really like to add a note to 
schedule_on_each_cpu()'s doc comment about it so others can avoid that 
pitfall).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-09 Thread Cody P Schafer


On 04/08/2013 11:06 PM, Gilad Ben-Yossef wrote:

On Tue, Apr 9, 2013 at 9:03 AM, Gilad Ben-Yossef  wrote:




I also wonder whether there could be unexpected interactions between ->high
and ->batch not changing together atomically. For example, could adjusting
this knob cause ->batch to rise enough that it is greater than the previous
->high? If the code above then runs with the previous ->high, ->count
wouldn't be correct (checking this inside free_pcppages_bulk() might help on
this one issue).


You are right, but that can be treated in  setup_pagelist_highmark()  e.g.:

3993 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
3994 unsigned long high)
3995 {
3996 struct per_cpu_pages *pcp;
 unsigned int batch;
3997
3998 pcp = &p->pcp;
 /* We're about to mess with PCP in an non atomic fashion.
Put an intermediate safe value of batch and make sure it
is visible before any other change */
 pcp->batch = 1UL;
 smb_mb();

3999 pcp->high = high;


and i think I missed another needed barrier here:
   smp_mb();



4000 batch = max(1UL, high/4);
4001 if ((high/4) > (PAGE_SHIFT * 8))
4002 batch = PAGE_SHIFT * 8;

pcp->batch = batch;
4003 }





Yep, that appears to work, provided no additional users of ->batch and 
->high show up. It seems we'll also need some locking to prevent 
concurrent updaters, but that is relatively light weight.


I'll roll up a new patchset that uses this methodology.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 00/10] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-09 Thread Cody P Schafer

"Problems" with the current code:
 1. there is a lack of synchronization in setting ->high and ->batch in 
percpu_pagelist_fraction_sysctl_handler()
 2. stop_machine() in zone_pcp_update() is unnecissary.
 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction 
is non-zero

To fix:
 1. add memory barriers, a safe ->batch value, and an update side mutex when 
updating ->high and ->batch
 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers 
added to fix #1
 3. factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(), which is
used by memory_hotplug. I'm rather certain that I've diserned (and preserved)
the essential behavior (changing ->high and ->batch), and only eliminated
unneeded actions (draining the per cpu pages), but this may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to offline a
page, and after the offline has "succeeded". It appears that the draining was
added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
funtions.

--
Changes since v1:

 - instead of using on_each_cpu(), use memory barriers (Gilad) and an update 
side mutex.
 - add "Problem" #3 above, and fix.
 - rename function to match naming style of similar function
 - move unrelated comment

Cody P Schafer (10):
  mm/page_alloc: factor out setting of pcp->high and pcp->batch.
  mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
  mm/page_alloc: insert memory barriers to allow async update of pcp
batch and high
  mm/page_alloc: convert zone_pcp_update() to rely on memory barriers
instead of stop_machine()
  mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly
recalulate high
  mm/page_alloc: factor setup_pageset() into pageset_init() and
pageset_set_batch()
  mm/page_alloc: relocate comment to be directly above code it refers
to.
  mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
  mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
  mm/page_alloc: rename setup_pagelist_highmark() to match naming of
pageset_set_batch()

 mm/page_alloc.c | 124 +---
 1 file changed, 73 insertions(+), 51 deletions(-)

-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 01/10] mm/page_alloc: factor out setting of pcp->high and pcp->batch.

2013-04-09 Thread Cody P Schafer

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high
based on ->batch.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..5877cf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone)
 #endif
 }
 
+/* a companion to setup_pagelist_highmark() */
+static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
+{
+   struct per_cpu_pages *pcp = &p->pcp;
+   pcp->high = 6 * batch;
+   pcp->batch = max(1UL, 1 * batch);
+}
+
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
struct per_cpu_pages *pcp;
@@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 
pcp = &p->pcp;
pcp->count = 0;
-   pcp->high = 6 * batch;
-   pcp->batch = max(1UL, 1 * batch);
+   pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
@@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
  * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
  * to the value high for the pageset p.
  */
-
 static void setup_pagelist_highmark(struct per_cpu_pageset *p,
unsigned long high)
 {
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 07/10] mm/page_alloc: relocate comment to be directly above code it refers to.

2013-04-09 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2f2207..6e52e67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3680,12 +3680,12 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct 
zone *zone)
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
} else {
-   /* we have to stop all cpus to guarantee there is no user
-  of zonelist */
 #ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
setup_zone_pageset(zone);
 #endif
+   /* we have to stop all cpus to guarantee there is no user
+  of zonelist */
stop_machine(__build_all_zonelists, pgdat, NULL);
/* cpuset refresh routine should be here */
}
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 10/10] mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()

2013-04-09 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 334387e..66b8bc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4018,7 +4018,7 @@ static void pageset_update_prep(struct per_cpu_pages *pcp)
smp_wmb();
 }
 
-/* a companion to setup_pagelist_highmark() */
+/* a companion to pageset_set_high() */
 static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
 {
struct per_cpu_pages *pcp = &p->pcp;
@@ -4050,10 +4050,10 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 }
 
 /*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
  * to the value high for the pageset p.
  */
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+static void pageset_set_high(struct per_cpu_pageset *p,
unsigned long high)
 {
struct per_cpu_pages *pcp;
@@ -4075,7 +4075,7 @@ static void __meminit zone_pageset_init(struct zone 
*zone, int cpu)
 
pageset_init(pcp);
if (percpu_pagelist_fraction)
-   setup_pagelist_highmark(pcp,
+   pageset_set_high(pcp,
(zone->managed_pages /
percpu_pagelist_fraction));
else
@@ -5526,8 +5526,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
unsigned long  high;
high = zone->managed_pages / percpu_pagelist_fraction;
for_each_possible_cpu(cpu)
-   setup_pagelist_highmark(
-   per_cpu_ptr(zone->pageset, cpu), high);
+   pageset_set_high(per_cpu_ptr(zone->pageset, cpu),
+high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 08/10] mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()

2013-04-09 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e52e67..c663e62 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4069,22 +4069,25 @@ static void setup_pagelist_highmark(struct 
per_cpu_pageset *p,
pcp->batch = PAGE_SHIFT * 8;
 }
 
+static void __meminit zone_pageset_init(struct zone *zone, int cpu)
+{
+   struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+   pageset_init(pcp);
+   if (percpu_pagelist_fraction)
+   setup_pagelist_highmark(pcp,
+   (zone->managed_pages /
+   percpu_pagelist_fraction));
+   else
+   pageset_set_batch(pcp, zone_batchsize(zone));
+}
+
 static void __meminit setup_zone_pageset(struct zone *zone)
 {
int cpu;
-
zone->pageset = alloc_percpu(struct per_cpu_pageset);
-
-   for_each_possible_cpu(cpu) {
-   struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
-
-   setup_pageset(pcp, zone_batchsize(zone));
-
-   if (percpu_pagelist_fraction)
-   setup_pagelist_highmark(pcp,
-   (zone->managed_pages /
-   percpu_pagelist_fraction));
-   }
+   for_each_possible_cpu(cpu)
+   zone_pageset_init(zone, cpu);
 }
 
 /*
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 06/10] mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()

2013-04-09 Thread Cody P Schafer

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 50a277a..a2f2207 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4030,7 +4030,7 @@ static void pageset_set_batch(struct per_cpu_pageset *p, 
unsigned long batch)
pcp->batch = max(1UL, 1 * batch);
 }
 
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void pageset_init(struct per_cpu_pageset *p)
 {
struct per_cpu_pages *pcp;
int migratetype;
@@ -4039,11 +4039,16 @@ static void setup_pageset(struct per_cpu_pageset *p, 
unsigned long batch)
 
pcp = &p->pcp;
pcp->count = 0;
-   pageset_set_batch(p, batch);
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+   pageset_init(p);
+   pageset_set_batch(p, batch);
+}
+
 /*
  * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
  * to the value high for the pageset p.
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 09/10] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

2013-04-09 Thread Cody P Schafer

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0. Correct this by
calling zone_pageset_init(), which chooses the appropriate ->batch and
->high calculations.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c663e62..334387e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6025,11 +6025,9 @@ void free_contig_range(unsigned long pfn, unsigned 
nr_pages)
 void __meminit zone_pcp_update(struct zone *zone)
 {
unsigned cpu;
-   unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
-   batch = zone_batchsize(zone);
for_each_possible_cpu(cpu)
-   pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
+   zone_pageset_init(zone, cpu);
mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 05/10] mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high

2013-04-09 Thread Cody P Schafer

Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a03c56..50a277a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5502,7 +5502,6 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, 
int write,
  * cpu.  It is the fraction of total pages in each zone that a hot per cpu 
pagelist
  * can have before it gets flushed back to buddy allocator.
  */
-
 int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -5516,12 +5515,11 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
 
mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
-   for_each_possible_cpu(cpu) {
-   unsigned long  high;
-   high = zone->managed_pages / percpu_pagelist_fraction;
+   unsigned long  high;
+   high = zone->managed_pages / percpu_pagelist_fraction;
+   for_each_possible_cpu(cpu)
setup_pagelist_highmark(
-   per_cpu_ptr(zone->pageset, cpu), high);
-   }
+   per_cpu_ptr(zone->pageset, cpu), high);
}
mutex_unlock(&pcp_batch_high_lock);
return 0;
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 02/10] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

2013-04-09 Thread Cody P Schafer

Because we are going to rely upon a careful transision between old and
new ->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5877cf0..d259599 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,9 @@
 #include 
 #include "internal.h"
 
+/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
+static DEFINE_MUTEX(pcp_batch_high_lock);
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -5491,6 +5494,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (!write || (ret < 0))
return ret;
+
+   mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
for_each_possible_cpu(cpu) {
unsigned long  high;
@@ -5499,6 +5504,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
per_cpu_ptr(zone->pageset, cpu), high);
}
}
+   mutex_unlock(&pcp_batch_high_lock);
return 0;
 }
 
@@ -6012,7 +6018,9 @@ static int __meminit __zone_pcp_update(void *data)
 
 void __meminit zone_pcp_update(struct zone *zone)
 {
+   mutex_lock(&pcp_batch_high_lock);
stop_machine(__zone_pcp_update, zone, NULL);
+   mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
 
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 03/10] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

2013-04-09 Thread Cody P Schafer

In pageset_set_batch() and setup_pagelist_highmark(), ensure that batch
is always set to a safe value (1) prior to updating high, and ensure
that high is fully updated before setting the real value of batch.

Suggested by Gilad Ben-Yossef  in this thread:

https://lkml.org/lkml/2013/4/9/23

Also reproduces his proposed comment.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d259599..a07bd4c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4007,11 +4007,26 @@ static int __meminit zone_batchsize(struct zone *zone)
 #endif
 }
 
+static void pageset_update_prep(struct per_cpu_pages *pcp)
+{
+   /*
+* We're about to mess with PCP in an non atomic fashion.  Put an
+* intermediate safe value of batch and make sure it is visible before
+* any other change
+*/
+   pcp->batch = 1;
+   smp_wmb();
+}
+
 /* a companion to setup_pagelist_highmark() */
 static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
 {
struct per_cpu_pages *pcp = &p->pcp;
+   pageset_update_prep(pcp);
+
pcp->high = 6 * batch;
+   smp_wmb();
+
pcp->batch = max(1UL, 1 * batch);
 }
 
@@ -4039,7 +4054,11 @@ static void setup_pagelist_highmark(struct 
per_cpu_pageset *p,
struct per_cpu_pages *pcp;
 
pcp = &p->pcp;
+   pageset_update_prep(pcp);
+
pcp->high = high;
+   smp_wmb();
+
pcp->batch = max(1UL, high/4);
if ((high/4) > (PAGE_SHIFT * 8))
pcp->batch = PAGE_SHIFT * 8;
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 04/10] mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()

2013-04-09 Thread Cody P Schafer

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.

This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).

This does change the behavior of zone_pcp_update() as the percpu
pagesets will not be drained when zone_pcp_update() is called (they will
end up being shrunk, not completely drained, later when a 0-order page
is freed in free_hot_cold_page()).

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 33 +
 1 file changed, 9 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a07bd4c..4a03c56 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6012,33 +6012,18 @@ void free_contig_range(unsigned long pfn, unsigned 
nr_pages)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit __zone_pcp_update(void *data)
-{
-   struct zone *zone = data;
-   int cpu;
-   unsigned long batch = zone_batchsize(zone), flags;
-
-   for_each_possible_cpu(cpu) {
-   struct per_cpu_pageset *pset;
-   struct per_cpu_pages *pcp;
-
-   pset = per_cpu_ptr(zone->pageset, cpu);
-   pcp = &pset->pcp;
-
-   local_irq_save(flags);
-   if (pcp->count > 0)
-   free_pcppages_bulk(zone, pcp->count, pcp);
-   drain_zonestat(zone, pset);
-   setup_pageset(pset, batch);
-   local_irq_restore(flags);
-   }
-   return 0;
-}
-
+/*
+ * The zone indicated has a new number of managed_pages; batch sizes and percpu
+ * page high values need to be recalulated.
+ */
 void __meminit zone_pcp_update(struct zone *zone)
 {
+   unsigned cpu;
+   unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
-   stop_machine(__zone_pcp_update, zone, NULL);
+   batch = zone_batchsize(zone);
+   for_each_possible_cpu(cpu)
+   pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 02/11] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

2013-04-10 Thread Cody P Schafer

Because we are going to rely upon a careful transision between old and
new ->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5877cf0..d259599 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,9 @@
 #include 
 #include "internal.h"
 
+/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
+static DEFINE_MUTEX(pcp_batch_high_lock);
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -5491,6 +5494,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (!write || (ret < 0))
return ret;
+
+   mutex_lock(&pcp_batch_high_lock);
for_each_populated_zone(zone) {
for_each_possible_cpu(cpu) {
unsigned long  high;
@@ -5499,6 +5504,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table 
*table, int write,
per_cpu_ptr(zone->pageset, cpu), high);
}
}
+   mutex_unlock(&pcp_batch_high_lock);
return 0;
 }
 
@@ -6012,7 +6018,9 @@ static int __meminit __zone_pcp_update(void *data)
 
 void __meminit zone_pcp_update(struct zone *zone)
 {
+   mutex_lock(&pcp_batch_high_lock);
stop_machine(__zone_pcp_update, zone, NULL);
+   mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
 
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-10 Thread Cody P Schafer

"Problems" with the current code:
 1. there is a lack of synchronization in setting ->high and ->batch in
percpu_pagelist_fraction_sysctl_handler()
 2. stop_machine() in zone_pcp_update() is unnecissary.
 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction 
is non-zero

To fix:
 1. add memory barriers, a safe ->batch value, an update side mutex when
updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that
expect a stable value.
 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers 
added to fix #1
 3. factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(), which is
used by memory_hotplug. I'm rather certain that I've diserned (and preserved)
the essential behavior (changing ->high and ->batch), and only eliminated
unneeded actions (draining the per cpu pages), but this may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to offline a
page, and after the offline has "succeeded". It appears that the draining was
added to zone_pcp_update() to avoid refactoring setup_pageset() into 2
funtions.

--

Since v2: https://lkml.org/lkml/2013/4/9/718

 - note ACCESS_ONCE() in fix #1 above.
 - consolidate ->batch & ->high update protocol into a single funtion (Gilad).
 - add missing ACCESS_ONCE() on ->batch

Since v1: https://lkml.org/lkml/2013/4/5/444

 - instead of using on_each_cpu(), use memory barriers (Gilad) and an update 
side mutex.
 - add "Problem" #3 above, and fix.
 - rename function to match naming style of similar function
 - move unrelated comment

--

Cody P Schafer (11):
  mm/page_alloc: factor out setting of pcp->high and pcp->batch.
  mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
  mm/page_alloc: insert memory barriers to allow async update of pcp
batch and high
  mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE
  mm/page_alloc: convert zone_pcp_update() to rely on memory barriers
instead of stop_machine()
  mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly
recalulate high
  mm/page_alloc: factor setup_pageset() into pageset_init() and
pageset_set_batch()
  mm/page_alloc: relocate comment to be directly above code it refers
to.
  mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
  mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
  mm/page_alloc: rename setup_pagelist_highmark() to match naming of
pageset_set_batch()

 mm/page_alloc.c | 151 +---
 1 file changed, 90 insertions(+), 61 deletions(-)

-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 10/11] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

2013-04-10 Thread Cody P Schafer

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0. Correct this by
calling zone_pageset_init(), which chooses the appropriate ->batch and
->high calculations.

Signed-off-by: Cody P Schafer 
---
 mm/page_alloc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 749b6e1..5ee5ce9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6032,11 +6032,9 @@ void free_contig_range(unsigned long pfn, unsigned 
nr_pages)
 void __meminit zone_pcp_update(struct zone *zone)
 {
unsigned cpu;
-   unsigned long batch;
mutex_lock(&pcp_batch_high_lock);
-   batch = zone_batchsize(zone);
for_each_possible_cpu(cpu)
-   pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch);
+   zone_pageset_init(zone, cpu);
mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 >

1 - 100 of 452 matches

Mail list logo