[PATCH] tools/perf: don't include libio.h in arch's dwarf-regs.c
Including libio.h causes build failures on uClibc systems (which lack libio.h). It appears that libio.h was only included to pull in a definition for null, so it has been replaced by stddef.h On powerpc, libio.h was conditionally included, but could be removed completely as it is unneeded. Also, the included of stdlib.h was changed to stddef.h (as again, only NULL is needed). Build tested on powerpc and x86. Signed-off-by: Cody P Schafer --- tools/perf/arch/powerpc/util/dwarf-regs.c | 5 + tools/perf/arch/s390/util/dwarf-regs.c| 2 +- tools/perf/arch/sh/util/dwarf-regs.c | 2 +- tools/perf/arch/sparc/util/dwarf-regs.c | 2 +- tools/perf/arch/x86/util/dwarf-regs.c | 2 +- 5 files changed, 5 insertions(+), 8 deletions(-) diff --git a/tools/perf/arch/powerpc/util/dwarf-regs.c b/tools/perf/arch/powerpc/util/dwarf-regs.c index 7cdd61d..733151c 100644 --- a/tools/perf/arch/powerpc/util/dwarf-regs.c +++ b/tools/perf/arch/powerpc/util/dwarf-regs.c @@ -9,10 +9,7 @@ * 2 of the License, or (at your option) any later version. */ -#include -#ifndef __UCLIBC__ -#include -#endif +#include #include diff --git a/tools/perf/arch/s390/util/dwarf-regs.c b/tools/perf/arch/s390/util/dwarf-regs.c index e19653e..0469df0 100644 --- a/tools/perf/arch/s390/util/dwarf-regs.c +++ b/tools/perf/arch/s390/util/dwarf-regs.c @@ -6,7 +6,7 @@ * */ -#include +#include #include #define NUM_GPRS 16 diff --git a/tools/perf/arch/sh/util/dwarf-regs.c b/tools/perf/arch/sh/util/dwarf-regs.c index a11edb0..0d0897f 100644 --- a/tools/perf/arch/sh/util/dwarf-regs.c +++ b/tools/perf/arch/sh/util/dwarf-regs.c @@ -19,7 +19,7 @@ * */ -#include +#include #include /* diff --git a/tools/perf/arch/sparc/util/dwarf-regs.c b/tools/perf/arch/sparc/util/dwarf-regs.c index 0ab8848..92eda41 100644 --- a/tools/perf/arch/sparc/util/dwarf-regs.c +++ b/tools/perf/arch/sparc/util/dwarf-regs.c @@ -9,7 +9,7 @@ * 2 of the License, or (at your option) any later version. */ -#include +#include #include #define SPARC_MAX_REGS 96 diff --git a/tools/perf/arch/x86/util/dwarf-regs.c b/tools/perf/arch/x86/util/dwarf-regs.c index a794d30..be22dd4 100644 --- a/tools/perf/arch/x86/util/dwarf-regs.c +++ b/tools/perf/arch/x86/util/dwarf-regs.c @@ -20,7 +20,7 @@ * */ -#include +#include #include /* -- 1.8.1.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/9] mm/page_alloc: add informative debugging message in page_outside_zone_boundaries()
On 02/01/2013 04:29 PM, Andrew Morton wrote: On Fri, 1 Feb 2013 16:28:48 -0800 Andrew Morton wrote: + if (ret) + pr_debug("page %lu outside zone [ %lu - %lu ]\n", + pfn, start_pfn, start_pfn + sp); + return ret; } As this condition leads to a VM_BUG_ON(), "pr_debug" seems rather wimpy and I doubt if we need to be concerned about flooding the console. I'll switch it to pr_err. otoh, as nobody has ever hit that VM_BUG_ON() (yes?), do we really need the patch? I've hit this bug while developing some code that moves pages between zones. As it helped me debug that issue with my own code, I could see how another developer might be helped by it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/9] mm: add SECTION_IN_PAGE_FLAGS
On 02/01/2013 04:20 PM, Andrew Morton wrote: On Thu, 17 Jan 2013 14:52:53 -0800 Cody P Schafer wrote: Instead of directly utilizing a combination of config options to determine this, add a macro to specifically address it. ... --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -625,6 +625,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) #define NODE_NOT_IN_PAGE_FLAGS #endif +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +#define SECTION_IN_PAGE_FLAGS +#endif We could do this in Kconfig itself, in the definition of a new CONFIG_SECTION_IN_PAGE_FLAGS. Yep, I only put it here because it "sounds" the similar to NODE_NOT_IN_PAGE_FLAGS, but (of course) NODE_NOT_IN_PAGE_FLAGS isn't defined based on pure dependencies, while this is. I'm not sure that I like that sort of thing a lot though - it's rather a pain to have to switch from .[ch] over to Kconfig to find the definitions of things. I should get off my tail and teach my ctags scripts to handle this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/9] mm: zone & pgdat accessors plus some cleanup
On 02/01/2013 04:39 PM, Andrew Morton wrote: On Thu, 17 Jan 2013 14:52:52 -0800 Cody P Schafer wrote: Summaries: 1 - avoid repeating checks for section in page flags by adding a define. 2 - add & switch to zone_end_pfn() and zone_spans_pfn() 3 - adds zone_is_initialized() & zone_is_empty() 4 - adds a VM_BUG using zone_is_initialized() in __free_one_page() 5 - add pgdat_end_pfn() and pgdat_is_empty() 6 - add debugging message to VM_BUG check. 7 - add ensure_zone_is_initialized() (for memory_hotplug) 8 - use the above addition in memory_hotplug 9 - use pgdat_end_pfn() Well that's a nice little patchset. Some of the patches were marked From:c...@linux.vnet.ibm.com and others were From:jmes...@gmail.com. This is strange. If you want me to fix that up, please let me know which is preferred. They should all be "From:c...@linux.vnet.ibm.com", other address was me messing up gitconfig (which I've since fixed). As a general concern: spanned_pages & start_pfn (in pgdat & zone) are supposed to be locked (via a seqlock) when read (due to changes to them via memory_hotplug), but very few (only 1?) of their users appear to actually lock them. OK, thanks. Perhaps this is something which the memory-hotplug developers could take a look at? Yep. It's not immediately clear that not locking on read would do terrible things, but at the least the documentation needs fixing and explanation as to why the locking is not used some (or all) places. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.
Creates pageset_set_batch() for use in setup_pageset(). pageset_set_batch() imitates the functionality of setup_pagelist_highmark(), but uses the boot time (percpu_pagelist_fraction == 0) calculations for determining ->high based on ->batch. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fcced7..5877cf0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone) #endif } +/* a companion to setup_pagelist_highmark() */ +static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) +{ + struct per_cpu_pages *pcp = &p->pcp; + pcp->high = 6 * batch; + pcp->batch = max(1UL, 1 * batch); +} + static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) { struct per_cpu_pages *pcp; @@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) pcp = &p->pcp; pcp->count = 0; - pcp->high = 6 * batch; - pcp->batch = max(1UL, 1 * batch); + pageset_set_batch(p, batch); for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) INIT_LIST_HEAD(&pcp->lists[migratetype]); } @@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist * to the value high for the pageset p. */ - static void setup_pagelist_highmark(struct per_cpu_pageset *p, unsigned long high) { -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
No off-cpu users of the percpu pagesets exist. zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a percpu pageset based on a zone's ->managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. Avoid calling setup_pageset() (and the draining required to call it) and instead just set the fields' values. This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 30 ++ 1 file changed, 10 insertions(+), 20 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5877cf0..48f2faa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5987,32 +5987,22 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages) #endif #ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit __zone_pcp_update(void *data) +static void __meminit __zone_pcp_update(void *data) { struct zone *zone = data; - int cpu; - unsigned long batch = zone_batchsize(zone), flags; - - for_each_possible_cpu(cpu) { - struct per_cpu_pageset *pset; - struct per_cpu_pages *pcp; - - pset = per_cpu_ptr(zone->pageset, cpu); - pcp = &pset->pcp; - - local_irq_save(flags); - if (pcp->count > 0) - free_pcppages_bulk(zone, pcp->count, pcp); - drain_zonestat(zone, pset); - setup_pageset(pset, batch); - local_irq_restore(flags); - } - return 0; + unsigned long batch = zone_batchsize(zone); + struct per_cpu_pageset *pset = + per_cpu_ptr(zone->pageset, smp_processor_id()); + pageset_set_batch(pset, batch); } +/* + * The zone indicated has a new number of managed_pages; batch sizes and percpu + * page high values need to be recalulated. + */ void __meminit zone_pcp_update(struct zone *zone) { - stop_machine(__zone_pcp_update, zone, NULL); + on_each_cpu(__zone_pcp_update, zone, true); } #endif -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
In free_hot_cold_page(), we rely on pcp->batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. Change for_each_cpu() to on_each_cpu() to fix. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48f2faa..507db31 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5475,30 +5475,31 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write, return 0; } +static void _zone_set_pageset_highmark(void *data) +{ + struct zone *zone = data; + unsigned long high; + high = zone->managed_pages / percpu_pagelist_fraction; + setup_pagelist_highmark( + per_cpu_ptr(zone->pageset, smp_processor_id()), high); +} + /* * percpu_pagelist_fraction - changes the pcp->high for each zone on each * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist * can have before it gets flushed back to buddy allocator. */ - int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { struct zone *zone; - unsigned int cpu; int ret; ret = proc_dointvec_minmax(table, write, buffer, length, ppos); if (!write || (ret < 0)) return ret; - for_each_populated_zone(zone) { - for_each_possible_cpu(cpu) { - unsigned long high; - high = zone->managed_pages / percpu_pagelist_fraction; - setup_pagelist_highmark( - per_cpu_ptr(zone->pageset, cpu), high); - } - } + for_each_populated_zone(zone) + on_each_cpu(_zone_set_pageset_highmark, zone, true); return 0; } -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch
In one case while modifying the ->high and ->batch fields of per cpu pagesets we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any syncronization at all (patch 3). This patchset fixes both of them. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I _think_ that I've diserned (and preserved) the essential behavior (changing ->high and ->batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. -- mm/page_alloc.c | 63 +++-- 1 file changed, 30 insertions(+), 33 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] DNUMA: Runtime NUMA memory layout reconfiguration
These patches allow the NUMA memory layout (meaning the mapping of a page to a node) to be changed at runtime in place (without hotplugging). = Why/when is this useful? = In virtual machines (VMs) running on NUMA systems both [a] if/when the hypervisor decides to move their backing memory around (compacting, prioritizing another VMs desired layout, etc) and [b] in general for migration of VMs. The hardware is _already_ changing the NUMA layout underneath us. We have powerpc64 systems with firmware that currently move the backing memory around, and have the ability to notify Linux of new NUMA info. = Code & testing = web: https://github.com/jmesmon/linux/tree/dnuma/v26 git: https://github.com/jmesmon/linux.git dnuma/v26 commit range: 7e4f3230c9161706ebe9d37d774398082dc352de^..01e16461cf4a914feb1a34ed8dd7b28f3e842645 Some patches are marked "XXX: ...", they are only for testing or temporary documentation purposes. A debugfs interface allows the NUMA memory layout to be changed. Basically, you don't need to have wierd systems to test this, in fact, I've done all my testing so far in plain old qemu-i386. A script which stripes the memory between nodes or pushes all memory to a (potentially new) node is avaliable here: https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test Related LSF/MM Topic proposal: http://permalink.gmane.org/gmane.linux.kernel.mm/95342 = How are you managing to do this? = Reconfiguration of page->node mappings is done at the page allocator level by both pulling out free pages (when a new memory layout is commited) & redirecting pages on free to their new node. Because we can't change page_node(A) while A is allocated, a rbtree holding the mapping from pfn ranges to node ids ('struct memlayout') is introduced to track the pfn->node mapping for yet-to-be-transplanted pages. A lookup in this rbtree occurs on any page allocator path that decides which zone to free a page to. To avoid horrible performance due to rbtree lookups all the time, the rbtree is only consulted when the page is marked with a new pageflag (LookupNode). = Current Limitations = For the reconfiguration to be effective (and not make the allocator make poorer choices), updating the cpu->node mappings is also needed. This patchset does _not_ handle this. Also missing is a way to update topology (node distances), which is less fatal. These patches only work on SPARSEMEM and the node id _must_ fit in the pageflags (can't be pushed out to the section). This generally means that 32-bit platforms are out (unless you hack MAX_PHYS{ADDR,MEM}_BITS like I do for testing). This code does the reconfiguration without hotplugging memory at all (1 errant page doesn't keep us from fixing the rest of them). But it still depends on MEMORY_HOTPLUG for functions that online nodes & adjust zone/pgdat size. Things that need doing (but aren't quite bugs): - While the interface is meant to be driven via a hypervisor/firmware, that portion is not yet included. - notifier for kernel users of memory that need/want their allocations on a particular node (NODE_DATA(), for instance). - notifier for userspace. - a way to allocate things from the appropriate node prior to the page allocator being fully updated (could just be "allocate it wrong now & reallocate later"). - Make memlayout faster (potentially via per-node allocation, different data structure, or more/smarter caching). - (potentially) propagation of updated layout knowledge into kmem_caches (SL*B). Known Bugs: - Transplant of free pages is _very_ slow due to excessive use of stop_machine() via zone_pcp_update() & build_zonelists(). On my i5 laptop, it take ~9 minutes to stripe the layout in blocks of 256 pfns to 3 nodes on a 128MB 8 cpu x86_32 VM booted with 2 nodes. - memory leak when SLUB is used (struct kmem_cache_nodes are leaked), SLAB appears fine. - Locking of managed_pages/present_pages needs adjustment, or they need to be updated outside of the free page path. - Exported numa/memory info in sysfs isn't updated (`numactl --show` segfaults, `numactl --hardware` shows new nodes as nearly empty). - Uses pageflag setters without "owning" pages, could cause loss of pageflag updates when combined with non-atomic pageflag users in mm/*. - some strange sleeps while atomic (for me they occur when memory is moved out of all the boot nodes) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 4/8] zswap: add to mm/
On 02/18/2013 11:24 AM, Seth Jennings wrote: On 02/15/2013 10:04 PM, Ric Mason wrote: On 02/14/2013 02:38 AM, Seth Jennings wrote: +/* invalidates all pages for the given swap type */ +static void zswap_frontswap_invalidate_area(unsigned type) +{ +struct zswap_tree *tree = zswap_trees[type]; +struct rb_node *node, *next; +struct zswap_entry *entry; + +if (!tree) +return; + +/* walk the tree and free everything */ +spin_lock(&tree->lock); +node = rb_first(&tree->rbroot); +while (node) { +entry = rb_entry(node, struct zswap_entry, rbnode); +zs_free(tree->pool, entry->handle); +next = rb_next(node); +zswap_entry_cache_free(entry); +node = next; +} +tree->rbroot = RB_ROOT; Why don't need rb_erase for every nodes? We are freeing the entire tree here. try_to_unuse() in the swapoff syscall should have already emptied the tree, but this is here for completeness. rb_erase() will do things like rebalancing the tree; something that just wastes time since we are in the process of freeing the whole tree. We are holding the tree lock here so we are sure that no one else is accessing the tree while it is in this transient broken state. If we have a sub-tree like: ... / A / \ B C B == rb_next(tree) A == rb_next(B) C == rb_next(A) The current code free's A (via zswap_entry_cache_free()) prior to examining C, and thus rb_next(C) results in a use after free of A. You can solve this by doing a post-order traversal of the tree, either a) in the destructive manner used in a number of filesystems, see fs/ubifs/orphan.c ubifs_add_orphan(), for example. b) or by doing something similar to this commit: https://github.com/jmesmon/linux/commit/d9e43aaf9e8a447d6802531d95a1767532339fad , which I've been using for some yet-to-be-merged code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv6 1/8] zsmalloc: add to mm/
On Wed, Feb 20, 2013 at 04:04:41PM -0600, Seth Jennings wrote: > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > +#define MAX(a, b) ((a) >= (b) ? (a) : (b)) > +/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */ > +#define ZS_MIN_ALLOC_SIZE \ > + MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) Could you use the max(a,b) defined in include/linux/kernel.h? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH] tools/perf: fix build on non-glibc systems due to libio.h absence
Including libio.h causes build failures on uClibc systems (which lack libio.h). It appears that libio.h was only included to pull in a definition for NULL, so it has been replaced by stddef.h. On powerpc, libio.h was conditionally included, but could be removed completely as it is unneeded. Also, the included of stdlib.h was changed to stddef.h (as again, only NULL is needed). Signed-off-by: Cody P Schafer --- tools/perf/arch/arm/util/dwarf-regs.c | 5 + tools/perf/arch/powerpc/util/dwarf-regs.c | 5 + tools/perf/arch/s390/util/dwarf-regs.c| 2 +- tools/perf/arch/sh/util/dwarf-regs.c | 2 +- tools/perf/arch/sparc/util/dwarf-regs.c | 2 +- tools/perf/arch/x86/util/dwarf-regs.c | 2 +- 6 files changed, 6 insertions(+), 12 deletions(-) diff --git a/tools/perf/arch/arm/util/dwarf-regs.c b/tools/perf/arch/arm/util/dwarf-regs.c index e8d5c55..33ec5b3 100644 --- a/tools/perf/arch/arm/util/dwarf-regs.c +++ b/tools/perf/arch/arm/util/dwarf-regs.c @@ -8,10 +8,7 @@ * published by the Free Software Foundation. */ -#include -#ifndef __UCLIBC__ -#include -#endif +#include #include struct pt_regs_dwarfnum { diff --git a/tools/perf/arch/powerpc/util/dwarf-regs.c b/tools/perf/arch/powerpc/util/dwarf-regs.c index 7cdd61d..733151c 100644 --- a/tools/perf/arch/powerpc/util/dwarf-regs.c +++ b/tools/perf/arch/powerpc/util/dwarf-regs.c @@ -9,10 +9,7 @@ * 2 of the License, or (at your option) any later version. */ -#include -#ifndef __UCLIBC__ -#include -#endif +#include #include diff --git a/tools/perf/arch/s390/util/dwarf-regs.c b/tools/perf/arch/s390/util/dwarf-regs.c index e19653e..0469df0 100644 --- a/tools/perf/arch/s390/util/dwarf-regs.c +++ b/tools/perf/arch/s390/util/dwarf-regs.c @@ -6,7 +6,7 @@ * */ -#include +#include #include #define NUM_GPRS 16 diff --git a/tools/perf/arch/sh/util/dwarf-regs.c b/tools/perf/arch/sh/util/dwarf-regs.c index a11edb0..0d0897f 100644 --- a/tools/perf/arch/sh/util/dwarf-regs.c +++ b/tools/perf/arch/sh/util/dwarf-regs.c @@ -19,7 +19,7 @@ * */ -#include +#include #include /* diff --git a/tools/perf/arch/sparc/util/dwarf-regs.c b/tools/perf/arch/sparc/util/dwarf-regs.c index 0ab8848..92eda41 100644 --- a/tools/perf/arch/sparc/util/dwarf-regs.c +++ b/tools/perf/arch/sparc/util/dwarf-regs.c @@ -9,7 +9,7 @@ * 2 of the License, or (at your option) any later version. */ -#include +#include #include #define SPARC_MAX_REGS 96 diff --git a/tools/perf/arch/x86/util/dwarf-regs.c b/tools/perf/arch/x86/util/dwarf-regs.c index a794d30..be22dd4 100644 --- a/tools/perf/arch/x86/util/dwarf-regs.c +++ b/tools/perf/arch/x86/util/dwarf-regs.c @@ -20,7 +20,7 @@ * */ -#include +#include #include /* -- 1.8.1.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/mempolicy: use NUMA_NO_NODE
@@ -1802,11 +1802,11 @@ static inline unsigned interleave_nid(struct mempolicy *pol, /* * Return the bit number of a random bit set in the nodemask. - * (returns -1 if nodemask is empty) + * (returns NUMA_NO_NOD if nodemask is empty) s/NUMA_NO_NOD/NUMA_NO_NODE/ */ int node_random(const nodemask_t *maskp) { - int w, bit = -1; + int w, bit = NUMA_NO_NODE; w = nodes_weight(*maskp); if (w) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
On 09/11/2013 03:08 PM, Dave Hansen wrote: I really don't know where the: batch /= 4; /* We effectively *= 4 below */ ... batch = rounddown_pow_of_two(batch + batch/2) - 1; came from. The round down code at *MOST* does a *= 1.5, but *averages* out to be just under 1. On a system with 128GB in a zone, this means that we've got (you can see in /proc/zoneinfo for yourself): high: 186 (744kB) batch: 31 (124kB) That 124kB is almost precisely 1/4 of the "1/2 of a meg" that we were shooting for. We're under-sizing the batches by about 4x. This patch kills the /=4. --- diff -puN mm/page_alloc.c~debug-pcp-sizes-1 mm/page_alloc.c --- linux.git/mm/page_alloc.c~debug-pcp-sizes-1 2013-09-11 14:41:08.532445664 -0700 +++ linux.git-davehans/mm/page_alloc.c 2013-09-11 15:03:47.403912683 -0700 @@ -4103,7 +4103,6 @@ static int __meminit zone_batchsize(stru batch = zone->managed_pages / 1024; if (batch * PAGE_SIZE > 512 * 1024) batch = (512 * 1024) / PAGE_SIZE; - batch /= 4; /* We effectively *= 4 below */ if (batch < 1) batch = 1; _ Looking back at the first git commit (way before my time), it appears that the percpu pagesets initially had a ->high and ->low (now removed), set to batch*6 and batch*2 respectively. I assume the idea was to keep the number of pages in the percpu pagesets around batch*4, hence the comment. So we have this variable called "batch", and the code is trying to store the _average_ number of pcp pages we want into it (not the batchsize), and then we divide our "average" goal by 4 to get a batchsize. All the comments refer to the size of the pcp pagesets, not to the pcp pageset batchsize. Looking further, in current code we don't refill the pcp pagesets unless they are completely empty (->low was removed a while ago), and then we only add ->batch pages. Has anyone looked at what type of average pcp sizing the current code results in? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mm: percpu pages: up batch size to fix arithmetic?? errror
On 09/11/2013 04:08 PM, Cody P Schafer wrote: On 09/11/2013 03:08 PM, Dave Hansen wrote: I really don't know where the: batch /= 4; /* We effectively *= 4 below */ ... batch = rounddown_pow_of_two(batch + batch/2) - 1; came from. The round down code at *MOST* does a *= 1.5, but *averages* out to be just under 1. On a system with 128GB in a zone, this means that we've got (you can see in /proc/zoneinfo for yourself): high: 186 (744kB) batch: 31 (124kB) That 124kB is almost precisely 1/4 of the "1/2 of a meg" that we were shooting for. We're under-sizing the batches by about 4x. This patch kills the /=4. --- diff -puN mm/page_alloc.c~debug-pcp-sizes-1 mm/page_alloc.c --- linux.git/mm/page_alloc.c~debug-pcp-sizes-12013-09-11 14:41:08.532445664 -0700 +++ linux.git-davehans/mm/page_alloc.c2013-09-11 15:03:47.403912683 -0700 @@ -4103,7 +4103,6 @@ static int __meminit zone_batchsize(stru batch = zone->managed_pages / 1024; if (batch * PAGE_SIZE > 512 * 1024) batch = (512 * 1024) / PAGE_SIZE; -batch /= 4;/* We effectively *= 4 below */ if (batch < 1) batch = 1; _ Looking back at the first git commit (way before my time), it appears that the percpu pagesets initially had a ->high and ->low (now removed), set to batch*6 and batch*2 respectively. I assume the idea was to keep the number of pages in the percpu pagesets around batch*4, hence the comment. So we have this variable called "batch", and the code is trying to store the _average_ number of pcp pages we want into it (not the batchsize), and then we divide our "average" goal by 4 to get a batchsize. All the comments refer to the size of the pcp pagesets, not to the pcp pageset batchsize. Looking further, in current code we don't refill the pcp pagesets unless they are completely empty (->low was removed a while ago), and then we only add ->batch pages. Has anyone looked at what type of average pcp sizing the current code results in? Also, we may want to consider shrinking pcp->high down from 6*pcp->batch given that the original "6*" choice was based upon ->batch actually being 1/4th of the average pageset size, where now it appears closer to being the average. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 02/10] mm: convert mm->nr_ptes to atomic_t
On 09/27/2013 06:16 AM, Kirill A. Shutemov wrote: With split page table lock for PMD level we can't hold mm->page_table_lock while updating nr_ptes. Let's convert it to atomic_t to avoid races. --- diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 84e0c56e1e..99f19e850d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -339,6 +339,7 @@ struct mm_struct { pgd_t * pgd; atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ + atomic_t nr_ptes; /* Page table pages */ int map_count; /* number of VMAs */ spinlock_t page_table_lock; /* Protects page tables and some counters */ @@ -360,7 +361,6 @@ struct mm_struct { unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */ unsigned long stack_vm; /* VM_GROWSUP/DOWN */ unsigned long def_flags; - unsigned long nr_ptes; /* Page table pages */ unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; Will 32bits always be enough here? Should atomic_long_t be used instead? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] rcu: comment: correct 'optimized' to 'optimize'.
Small gramar fix in rcutree comment regarding 'rcu_scheduler_active' var. --- kernel/rcutree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index e441b77..bfb8972 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -105,7 +105,7 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ * The rcu_scheduler_active variable transitions from zero to one just * before the first task is spawned. So when this variable is zero, RCU * can assume that there is but one task, allowing RCU to (for example) - * optimized synchronize_sched() to a simple barrier(). When this variable + * optimize synchronize_sched() to a simple barrier(). When this variable * is one, RCU must actually do all the hard work required to detect real * grace periods. This variable is also used to suppress boot-time false * positives from lockdep-RCU error checking. -- 1.8.0.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] powerpc/mm: eliminate unneeded for_each_memblock
The only persistent change made by this loop is calling memblock_set_node() once for each memblock, which is not useful (and has no effect) as memblock_set_node() is not called with any memblock-specific parameters. Subsistute a single memblock_set_node(). --- arch/powerpc/mm/mem.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 0dba506..50370bb 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -195,13 +195,8 @@ void __init do_init_bootmem(void) min_low_pfn = MEMORY_START >> PAGE_SHIFT; boot_mapsize = init_bootmem_node(NODE_DATA(0), start >> PAGE_SHIFT, min_low_pfn, max_low_pfn); - /* Add active regions with valid PFNs */ - for_each_memblock(memory, reg) { - unsigned long start_pfn, end_pfn; - start_pfn = memblock_region_memory_base_pfn(reg); - end_pfn = memblock_region_memory_end_pfn(reg); - memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0); - } + /* Place all memblock_regions in the same node and merge contiguous memblock_regions */ + memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0); /* Add all physical memory to the bootmem map, mark each area * present. -- 1.8.0.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/page_alloc: don't re-init pageset in zone_pcp_update()
On 06/12/2013 02:20 PM, Andrew Morton wrote: On Tue, 11 Jun 2013 15:12:59 -0700 Cody P Schafer wrote: Factor pageset_set_high_and_batch() (which contains all needed logic too set a pageset's ->high and ->batch inrespective of system state) out of zone_pageset_init(), which avoids us calling pageset_init(), and unsafely blowing away a pageset at runtime (leaked pages and potentially some funky allocations would be the result) when memory hotplug is triggered. This changelog is pretty screwed up :( It tells us what the patch does but not why it does it. It says why it does it, though perhaps a bit hidden: > avoids us calling pageset_init(), and unsafely blowing away a pageset Signed-off-by: Cody P Schafer --- Unless memory hotplug is being triggered on boot, this should *not* be cause of Valdis Kletnieks' reported bug in -next: "next-20130607 BUG: Bad page state in process systemd pfn:127643" And this addendum appears to hint at the info we need. Note the *not*. I included this note only because I expected there would be a question of whether Valdis's reported bug was caused by this. It _isn't_. The bug this fix fixes is only triggered by memory_hotplug, and Valdis's bug occurred on boot. Please, send a new changelog? That should include a description of the user-visible effects of the bug which is being fixed, a description of why it occurs and a description of how it was fixed.It would also be helpful if you can identify which kernel version(s) need the fix. It's just a -mm issue. It was introduced by my patchset starting with "mm/page_alloc: factor out setting of pcp->high and pcp->batch", where the actual place the bug snuck in was "mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()". Also, a Reported-by:Valdis would be appropriate. I'm fine with adding it (I did take a look at my page_alloc.c changes because he reported that bug), but as mentioned before, this fixes a different bug. Anyhow, a reorganized (and clearer) changelog with the same content follows: --- mm/page_alloc: don't re-init pageset in zone_pcp_update() When memory hotplug is triggered, we call pageset_init() on a per-cpu-pageset which both contains pages and is in use, causing both the leakage of those pages and (potentially) bad behaviour if a page is allocated from the pageset while it is being cleared. Avoid this by factoring pageset_set_high_and_batch() (which contains all needed logic too set a pageset's ->high and ->batch inrespective of system state), and using that instead of zone_pageset_init() in zone_pcp_update(). Signed-off-by: Cody P Schafer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/page_alloc: don't re-init pageset in zone_pcp_update()
Anyhow, a reorganized (and clearer) changelog with the same content follows: --- I made a few wording tweaks: --- mm/page_alloc: don't re-init pageset in zone_pcp_update() When memory hotplug is triggered, we call pageset_init() on per-cpu-pagesets which both contain pages and are in use, causing both the leakage of those pages and (potentially) bad behaviour if a page is allocated from a pageset while it is being cleared. Avoid this by factoring out pageset_set_high_and_batch() (which contains all needed logic too set a pageset's ->high and ->batch inrespective of system state) from zone_pageset_init() and using the new pageset_set_high_and_batch() instead of zone_pageset_init() in zone_pcp_update(). Signed-off-by: Cody P Schafer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH updated] drivers/base: Use attribute groups to create sysfs memory files
On 05/28/2013 10:17 PM, Nathan Fontenot wrote: Update the sysfs memory code to create/delete files at the time of device and subsystem registration. The current code creates files in the root memory directory explicitly through the use of init_* routines. The files for each memory block are created and deleted explicitly using the mem_[create|delete]_simple_file macros. This patch creates attribute groups for the memory root files and files in each memory block directory so that they are created and deleted implicitly at subsys and device register and unregister time. This did necessitate moving the register_memory() routine and update it to set the dev.groups field. Signed-off-by: Nathan Fontenot Updated to apply cleanly to rc2. Please cc me on responses/comments. --- drivers/base/memory.c | 143 +- 1 file changed, 62 insertions(+), 81 deletions(-) Index: linux/drivers/base/memory.c === --- linux.orig/drivers/base/memory.c2013-05-28 22:53:58.0 -0500 +++ linux/drivers/base/memory.c 2013-05-28 22:56:49.0 -0500 These changes look good, but this email doesn't play nice with `git am`. ex: "fatal: corrupt patch at line 80" There is nothing particularly bad about line 80. Please fix and resend (git format-patch generally gets this right, maybe use that?) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: sparse: use __aligned() instead of manual padding in mem_section
Instead of leaving a trap for the next person who comes along and wants to add something to mem_section, add an __aligned() and remove the manual padding added for MEMCG. Signed-off-by: Cody P Schafer --- include/linux/mmzone.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- Also, does anyone know what causes this alignment to be required here? I found this was breaking things in a patchset I'm working on (WARNs in sysfs code about duplicate filenames when initing mem_sections). Adding some documentation for the reason would be appreciated. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 131989a..a8e8056 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1125,9 +1125,8 @@ struct mem_section { * section. (see memcontrol.h/page_cgroup.h about this.) */ struct page_cgroup *page_cgroup; - unsigned long pad; #endif -}; +} __aligned(2 * sizeof(unsigned long)); #ifdef CONFIG_SPARSEMEM_EXTREME #define SECTIONS_PER_ROOT (PAGE_SIZE / sizeof (struct mem_section)) -- 1.8.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: sparse: use __aligned() instead of manual padding in mem_section
On 05/29/2013 05:54 PM, Jiang Liu wrote: On Thu 30 May 2013 07:14:39 AM CST, Cody P Schafer wrote: Also, does anyone know what causes this alignment to be required here? I found this was breaking things in a patchset I'm working on (WARNs in sysfs code about duplicate filenames when initing mem_sections). Adding some documentation for the reason would be appreciated. Hi Cody, I think the alignment requirement is caused by the way the mem_section array is organized. Basically it requires that PAGE_SIZE could be divided by sizeof(struct mem_section). So your change seems risky too because it should be aligned to power of two instead of 2 * sizeof(long). Well, if that's the case then this patch is wrong, and manual padding may be the only way to go. :( -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sparsemem: BUILD_BUG_ON when sizeof mem_section is non-power-of-2
Instead of leaving a hidden trap for the next person who comes along and wants to add something to mem_section, add a big fat warning about it needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init() to catch mistakes. Right now non-power-of-2 mem_sections cause a number of WARNs at boot (which don't clearly point to the size of mem_section as an issue), but the system limps on (temporarily, at least). This is based upon Dave Hansen's earlier RFC where he ran into the same issue: "sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2" http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html Signed-off-by: Cody P Schafer --- Dave: Consider it resurrected. --- include/linux/mmzone.h | 4 mm/sparse.c| 3 +++ 2 files changed, 7 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 131989a..88e23f3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1127,6 +1127,10 @@ struct mem_section { struct page_cgroup *page_cgroup; unsigned long pad; #endif + /* +* WARNING: mem_section must be a power-of-2 in size for the +* calculation and use of SECTION_ROOT_MASK to make sense. +*/ }; #ifdef CONFIG_SPARSEMEM_EXTREME diff --git a/mm/sparse.c b/mm/sparse.c index 1c91f0d3..3194ec4 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -481,6 +481,9 @@ void __init sparse_init(void) struct page **map_map; #endif + /* see include/linux/mmzone.h 'struct mem_section' definition */ + BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section))); + /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */ set_pageblock_order(); -- 1.8.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] drivers/base: Use attribute groups to create sysfs memory files
On 05/31/2013 01:31 PM, Nathan Fontenot wrote: Update the sysfs memory code to create/delete files at the time of device and subsystem registration. The current code creates files in the root memory directory explicitly through the use of init_* routines. The files for each memory block are created and deleted explicitly using the mem_[create|delete]_simple_file macros. This patch creates attribute groups for the memory root files and files in each memory block directory so that they are created and deleted implicitly at subsys and device register and unregister time. This did necessitate moving the register_memory() updating it to set the dev.groups field. Signed-off-by: Nathan Fontenot Please cc me on responses/comments. v2: refreshed the patch, previous version was corrupted. There is no difference otherwise between this patch and the previous one sent out. Still looks broken. Tabs have been converted into spaces, and the top of the patch is whitespace padded to 80 characters. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm/page_alloc: don't re-init pageset in zone_pcp_update()
Factor pageset_set_high_and_batch() (which contains all needed logic too set a pageset's ->high and ->batch inrespective of system state) out of zone_pageset_init(), which avoids us calling pageset_init(), and unsafely blowing away a pageset at runtime (leaked pages and potentially some funky allocations would be the result) when memory hotplug is triggered. Signed-off-by: Cody P Schafer --- Unless memory hotplug is being triggered on boot, this should *not* be cause of Valdis Kletnieks' reported bug in -next: "next-20130607 BUG: Bad page state in process systemd pfn:127643" --- mm/page_alloc.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 18102e1..f62c7ac 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4111,11 +4111,9 @@ static void pageset_set_high(struct per_cpu_pageset *p, pageset_update(&p->pcp, high, batch); } -static void __meminit zone_pageset_init(struct zone *zone, int cpu) +static void __meminit pageset_set_high_and_batch(struct zone *zone, + struct per_cpu_pageset *pcp) { - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu); - - pageset_init(pcp); if (percpu_pagelist_fraction) pageset_set_high(pcp, (zone->managed_pages / @@ -4124,6 +4122,14 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu) pageset_set_batch(pcp, zone_batchsize(zone)); } +static void __meminit zone_pageset_init(struct zone *zone, int cpu) +{ + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu); + + pageset_init(pcp); + pageset_set_high_and_batch(zone, pcp); +} + static void __meminit setup_zone_pageset(struct zone *zone) { int cpu; @@ -6173,7 +6179,8 @@ void __meminit zone_pcp_update(struct zone *zone) unsigned cpu; mutex_lock(&pcp_batch_high_lock); for_each_possible_cpu(cpu) - zone_pageset_init(zone, cpu); + pageset_set_high_and_batch(zone, + per_cpu_ptr(zone->pageset, cpu)); mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next] net: poll/select low latency socket support
On 06/24/2013 12:28 AM, Eliezer Tamir wrote: select/poll busy-poll support. ... diff --git a/fs/select.c b/fs/select.c index 8c1c96c..79b876e 100644 --- a/fs/select.c +++ b/fs/select.c @@ -400,6 +402,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time) poll_table *wait; int retval, i, timed_out = 0; unsigned long slack = 0; + unsigned int ll_flag = POLL_LL; + u64 ll_time = ll_end_time(); rcu_read_lock(); retval = max_select_fd(n, fds); @@ -750,6 +768,8 @@ static int do_poll(unsigned int nfds, struct poll_list *list, ktime_t expire, *to = NULL; int timed_out = 0, count = 0; unsigned long slack = 0; + unsigned int ll_flag = POLL_LL; + u64 ll_time = ll_end_time(); /* Optimise the no-wait case */ if (end_time && !end_time->tv_sec && !end_time->tv_nsec) { diff --git a/include/net/ll_poll.h b/include/net/ll_poll.h index fcc7c36..5bf2b3a 100644 --- a/include/net/ll_poll.h +++ b/include/net/ll_poll.h @@ -38,17 +39,18 @@ extern unsigned int sysctl_net_ll_poll __read_mostly; /* we can use sched_clock() because we don't care much about precision * we only care that the average is bounded + * we don't mind a ~2.5% imprecision so <<10 instead of *1000 + * sk->sk_ll_usec is a u_int so this can't overflow */ -static inline u64 ll_end_time(struct sock *sk) +static inline u64 ll_sk_end_time(struct sock *sk) { - u64 end_time = ACCESS_ONCE(sk->sk_ll_usec); - - /* we don't mind a ~2.5% imprecision -* sk->sk_ll_usec is a u_int so this can't overflow -*/ - end_time = (end_time << 10) + sched_clock(); + return ((u64)ACCESS_ONCE(sk->sk_ll_usec) << 10) + sched_clock(); +} - return end_time; +/* in poll/select we use the global sysctl_net_ll_poll value */ +static inline u64 ll_end_time(void) +{ + return ((u64)ACCESS_ONCE(sysctl_net_ll_poll) << 10) + sched_clock(); } static inline bool sk_valid_ll(struct sock *sk) @@ -62,10 +64,13 @@ static inline bool can_poll_ll(u64 end_time) return !time_after64(sched_clock(), end_time); } +/* when used in sock_poll() nonblock is known at compile time to be true + * so the loop and end_time will be optimized out + */ static inline bool sk_poll_ll(struct sock *sk, int nonblock) { + u64 end_time = nonblock ? 0 : ll_sk_end_time(sk); const struct net_device_ops *ops; - u64 end_time = ll_end_time(sk); struct napi_struct *napi; int rc = false; I'm seeing warnings about using smp_processor_id() while preemptable (log included below) due to this patch. I expect the use of ll_end_time() -> sched_clock() here is triggering this. Apologies if this has already been noted. -- # [3.114452] BUG: using smp_processor_id() in preemptible [] code: sh/62 [3.117970] caller is native_sched_clock+0x20/0x80 [3.120303] CPU: 0 PID: 62 Comm: sh Not tainted 3.10.0-rc6-dnuma-01032-g2d48d67 #21 [3.123710] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [3.128616] 880002b6baf0 813c07d0 880002b6bb08 [3.135055] 811ff835 0004d076eeed 880002b6bb20 81009ac0 [3.137359] 880002b6bb30 81009b29 880002b6bf40 [3.138954] Call Trace: [3.139466] [] dump_stack+0x19/0x1b [3.140559] [] debug_smp_processor_id+0xd5/0xf0 [3.141831] [] native_sched_clock+0x20/0x80 [3.143031] [] sched_clock+0x9/0x10 [3.144127] [] do_sys_poll+0x1f6/0x500 [3.145239] [] ? sched_clock+0x9/0x10 [3.146335] [] ? native_sched_clock+0x20/0x80 [3.147557] [] ? sched_clock_local+0x1d/0x90 [3.148816] [] ? native_sched_clock+0x20/0x80 [3.150007] [] ? sched_clock+0x9/0x10 [3.151090] [] ? sched_clock_local+0x1d/0x90 [3.152419] [] ? native_sched_clock+0x20/0x80 [3.153638] [] ? native_sched_clock+0x20/0x80 [3.154865] [] ? sched_clock+0x9/0x10 [3.155961] [] ? sched_clock_local+0x1d/0x90 [3.157230] [] ? sched_clock_cpu+0xa8/0x100 [3.158433] [] ? SyS_getdents64+0x110/0x110 [3.159628] [] ? native_sched_clock+0x20/0x80 [3.160916] [] ? sched_clock+0x9/0x10 [3.162003] [] ? sched_clock_local+0x1d/0x90 [3.163207] [] ? sched_clock_cpu+0xa8/0x100 [3.164427] [] ? get_lock_stats+0x19/0x60 [3.165580] [] ? put_lock_stats.isra.28+0xe/0x40 [3.166856] [] ? __mutex_unlock_slowpath+0x105/0x1a0 [3.168412] [] ? trace_hardirqs_on_caller+0x105/0x1d0 [3.169944] [] ? trace_hardirqs_on+0xd/0x10 [3.171155] [] ? mutex_unlock+0x9/0x10 [3.172355] [] ? tty_ioctl+0xa53/0xd40 [3.173483] [] ? lock_release_non_nested+0x308/0x350 [3.174848] [] ? __lock_acquire+0x3d6/0xb70 [3.176087] [] ? trace_hardirqs_on_caller+0x105/0x1d0 [3.177466] [] ? do_vfs_ioctl+0x305/0x510 [3.178629] [] ? sysret_check+0x22/0x5d [3.179764] [] ? trace_hardirqs_on_caller+0x105/0x1d0 [
Re: [PATCH v4 net-next] net: poll/select low latency socket support
On 06/27/2013 05:25 PM, Cody P Schafer wrote: On 06/24/2013 12:28 AM, Eliezer Tamir wrote: select/poll busy-poll support. ... I'm seeing warnings about using smp_processor_id() while preemptable (log included below) due to this patch. I expect the use of ll_end_time() -> sched_clock() here is triggering this. Apologies if this has already been noted. To be clear, given how sched_time() is used here the BUG appears invalid, but we need a way to mark this as OK for the smp_processor_id() checks so we don't get BUG spam. -- # [3.114452] BUG: using smp_processor_id() in preemptible [] code: sh/62 ... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] trivial: mmzone: correct "pags" to "pages" in comment.
Signed-off-by: Cody P Schafer --- include/linux/mmzone.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c74092e..5c76737 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -450,7 +450,7 @@ struct zone { * * present_pages is physical pages existing within the zone, which * is calculated as: -* present_pages = spanned_pages - absent_pages(pags in holes); +* present_pages = spanned_pages - absent_pages(pages in holes); * * managed_pages is present pages managed by the buddy system, which * is calculated as (reserved_pages includes pages allocated by the -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] page_alloc: make setup_nr_node_ids() usable for arch init code
powerpc and x86 were opencoding copies of setup_nr_node_ids(), which page_alloc provides but makes static. Make it avaliable to the archs in linux/mm.h. Signed-off-by: Cody P Schafer --- include/linux/mm.h | 6 ++ mm/page_alloc.c| 6 +- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7acc9dc..3405405 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1755,5 +1755,11 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; } static inline bool page_is_guard(struct page *page) { return false; } #endif /* CONFIG_DEBUG_PAGEALLOC */ +#if MAX_NUMNODES > 1 +void __init setup_nr_node_ids(void); +#else +static inline void setup_nr_node_ids(void) {} +#endif + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fcced7..96909bb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4710,7 +4710,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, /* * Figure out the number of possible node ids. */ -static void __init setup_nr_node_ids(void) +void __init setup_nr_node_ids(void) { unsigned int node; unsigned int highest = 0; @@ -4719,10 +4719,6 @@ static void __init setup_nr_node_ids(void) highest = node; nr_node_ids = highest + 1; } -#else -static inline void setup_nr_node_ids(void) -{ -} #endif /** -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] mm: avoid duplication of setup_nr_node_ids()
In arch/powerpc, arch/x86, and mm/page_alloc code to setup nr_node_ids based on node_possible_map is duplicated. This patchset switches those copies to calling the function provided by page_alloc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] x86/mm/numa: use setup_nr_node_ids() instead of opencoding.
Signed-off-by: Cody P Schafer --- arch/x86/mm/numa.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 72fe01e..a71c4e2 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -114,14 +114,11 @@ void numa_clear_node(int cpu) */ void __init setup_node_to_cpumask_map(void) { - unsigned int node, num = 0; + unsigned int node; /* setup nr_node_ids if not done yet */ - if (nr_node_ids == MAX_NUMNODES) { - for_each_node_mask(node, node_possible_map) - num = node; - nr_node_ids = num + 1; - } + if (nr_node_ids == MAX_NUMNODES) + setup_nr_node_ids(); /* allocate the map */ for (node = 0; node < nr_node_ids; node++) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] powerpc/mm/numa: use setup_nr_node_ids() instead of opencoding.
Signed-off-by: Cody P Schafer --- arch/powerpc/mm/numa.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index bba87ca..7574ae3 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -62,14 +62,11 @@ static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; */ static void __init setup_node_to_cpumask_map(void) { - unsigned int node, num = 0; + unsigned int node; /* setup nr_node_ids if not done yet */ - if (nr_node_ids == MAX_NUMNODES) { - for_each_node_mask(node, node_possible_map) - num = node; - nr_node_ids = num + 1; - } + if (nr_node_ids == MAX_NUMNODES) + setup_nr_node_ids() /* allocate the map */ for (node = 0; node < nr_node_ids; node++) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] drivers: base: dynamic memory block creation
On 08/14/2013 02:14 PM, Seth Jennings wrote: >An existing tool would not work >with this patch (plus boot option) since it would not know how to >show/hide things. It lets_part_ of those existing tools get reused >since they only have to be taught how to show/hide things. > >I'd find this really intriguing if you found a way to keep even the old >tools working. Instead of having an explicit show/hide, why couldn't >you just create the entries on open(), for instance? Nathan and I talked about this and I'm not sure if sysfs would support such a thing, i.e. memory block creation when someone tried to cd into the memory block device config. I wouldn't know where to start on that. Also, I'd expect userspace tools might use readdir() to find out what memory blocks a system has (unless they just stat("memory0"), stat("memory1")...). I don't think filesystem tricks (at least within sysfs) are going to let this magically be solved without breaking the userspace API. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] rbtree_test: add test for postorder iteration
Just check that we examine all nodes in the tree for the postorder iteration. Signed-off-by: Cody P Schafer --- lib/rbtree_test.c | 12 1 file changed, 12 insertions(+) diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c index 122f02f..31dd4cc 100644 --- a/lib/rbtree_test.c +++ b/lib/rbtree_test.c @@ -114,6 +114,16 @@ static int black_path_count(struct rb_node *rb) return count; } +static void check_postorder(int nr_nodes) +{ + struct rb_node *rb; + int count = 0; + for (rb = rb_first_postorder(&root); rb; rb = rb_next_postorder(rb)) + count++; + + WARN_ON_ONCE(count != nr_nodes); +} + static void check(int nr_nodes) { struct rb_node *rb; @@ -136,6 +146,8 @@ static void check(int nr_nodes) WARN_ON_ONCE(count != nr_nodes); WARN_ON_ONCE(count < (1 << black_path_count(rb_last(&root))) - 1); + + check_postorder(nr_nodes); } static void check_augmented(int nr_nodes) -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] mm/zswap: use postorder iteration when destroying rbtree
Signed-off-by: Cody P Schafer --- mm/zswap.c | 15 ++- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index deda2b6..98d99c4 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -791,25 +791,14 @@ static void zswap_frontswap_invalidate_area(unsigned type) { struct zswap_tree *tree = zswap_trees[type]; struct rb_node *node; - struct zswap_entry *entry; + struct zswap_entry *entry, *n; if (!tree) return; /* walk the tree and free everything */ spin_lock(&tree->lock); - /* -* TODO: Even though this code should not be executed because -* the try_to_unuse() in swapoff should have emptied the tree, -* it is very wasteful to rebalance the tree after every -* removal when we are freeing the whole tree. -* -* If post-order traversal code is ever added to the rbtree -* implementation, it should be used here. -*/ - while ((node = rb_first(&tree->rbroot))) { - entry = rb_entry(node, struct zswap_entry, rbnode); - rb_erase(&entry->rbnode, &tree->rbroot); + rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) { zbud_free(tree->pool, entry->handle); zswap_entry_cache_free(entry); atomic_dec(&zswap_stored_pages); -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] rbtree: add rbtree_postorder_for_each_entry_safe() helper
Because deletion (of the entire tree) is a relatively common use of the rbtree_postorder iteration, and because doing it safely means fiddling with temporary storage, provide a helper to simplify postorder rbtree iteration. Signed-off-by: Cody P Schafer --- include/linux/rbtree.h | 17 + 1 file changed, 17 insertions(+) diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index 2879e96..64ab98b 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -85,4 +85,21 @@ static inline void rb_link_node(struct rb_node * node, struct rb_node * parent, *rb_link = node; } +/** + * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of + * given type safe against removal of rb_node entry + * + * @pos: the 'type *' to use as a loop cursor. + * @n: another 'type *' to use as temporary storage + * @root: 'rb_root *' of the rbtree. + * @field: the name of the rb_node field within 'type'. + */ +#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \ + for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\ + n = rb_entry(rb_next_postorder(&pos->field), \ + typeof(*pos), field); \ +&pos->field; \ +pos = n, \ + n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field)) + #endif /* _LINUX_RBTREE_H */ -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] rbtree: allow tests to run as builtin
No reason require rbtree test code to be a module, allow it to be builtin (streamlines my development process) Signed-off-by: Cody P Schafer --- lib/Kconfig.debug | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 1501aa5..606e3c8 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1442,7 +1442,7 @@ config BACKTRACE_SELF_TEST config RBTREE_TEST tristate "Red-Black tree test" - depends on m && DEBUG_KERNEL + depends on DEBUG_KERNEL help A benchmark measuring the performance of the rbtree library. Also includes rbtree invariant checks. -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] Add rbtree postorder iteration functions, runtime tests, and update zswap to use.
Postorder iteration yields all of a node's children prior to yielding the node itself, and this particular implementation also avoids examining the leaf links in a node after that node has been yielded. In what I expect will be it's most common usage, postorder iteration allows the deletion of every node in an rbtree without modifying the rbtree nodes (no _requirement_ that they be nulled) while avoiding referencing child nodes after they have been "deleted" (most commonly, freed). I have only updated zswap to use this functionality at this point, but numerous bits of code (most notably in the filesystem drivers) use a hand rolled postorder iteration that NULLs child links as it traverses the tree. Each of those instances could be replaced with this common implementation. Cody P Schafer (5): rbtree: add postorder iteration functions. rbtree: add rbtree_postorder_for_each_entry_safe() helper. rbtree_test: add test for postorder iteration. rbtree: allow tests to run as builtin mm/zswap: use postorder iteration when destroying rbtree include/linux/rbtree.h | 21 + lib/Kconfig.debug | 2 +- lib/rbtree.c | 40 lib/rbtree_test.c | 12 mm/zswap.c | 15 ++- 5 files changed, 76 insertions(+), 14 deletions(-) -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] rbtree: add postorder iteration functions
Add postorder iteration functions for rbtree. These are useful for safely freeing an entire rbtree without modifying the tree at all. Signed-off-by: Cody P Schafer --- include/linux/rbtree.h | 4 lib/rbtree.c | 40 2 files changed, 44 insertions(+) diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index 0022c1b..2879e96 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -68,6 +68,10 @@ extern struct rb_node *rb_prev(const struct rb_node *); extern struct rb_node *rb_first(const struct rb_root *); extern struct rb_node *rb_last(const struct rb_root *); +/* Postorder iteration - always visit the parent after it's children */ +extern struct rb_node *rb_first_postorder(const struct rb_root *); +extern struct rb_node *rb_next_postorder(const struct rb_node *); + /* Fast replacement of a single node without remove/rebalance/add/rebalance */ extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root); diff --git a/lib/rbtree.c b/lib/rbtree.c index c0e31fe..65f4eff 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct rb_node *new, *new = *victim; } EXPORT_SYMBOL(rb_replace_node); + +static struct rb_node *rb_left_deepest_node(const struct rb_node *node) +{ + for (;;) { + if (node->rb_left) + node = node->rb_left; + else if (node->rb_right) + node = node->rb_right; + else + return (struct rb_node *)node; + } +} + +struct rb_node *rb_next_postorder(const struct rb_node *node) +{ + const struct rb_node *parent; + if (!node) + return NULL; + parent = rb_parent(node); + + /* If we're sitting on node, we've already seen our children */ + if (parent && node == parent->rb_left && parent->rb_right) { + /* If we are the parent's left node, go to the parent's right +* node then all the way down to the left */ + return rb_left_deepest_node(parent->rb_right); + } else + /* Otherwise we are the parent's right node, and the parent +* should be next */ + return (struct rb_node *)parent; +} +EXPORT_SYMBOL(rb_next_postorder); + +struct rb_node *rb_first_postorder(const struct rb_root *root) +{ + if (!root->rb_node) + return NULL; + + return rb_left_deepest_node(root->rb_node); +} +EXPORT_SYMBOL(rb_first_postorder); -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] rbtree: add postorder iteration functions
On 07/29/2013 08:01 AM, Seth Jennings wrote: On Fri, Jul 26, 2013 at 02:13:39PM -0700, Cody P Schafer wrote: diff --git a/lib/rbtree.c b/lib/rbtree.c index c0e31fe..65f4eff 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct rb_node *new, *new = *victim; } EXPORT_SYMBOL(rb_replace_node); + +static struct rb_node *rb_left_deepest_node(const struct rb_node *node) +{ + for (;;) { + if (node->rb_left) + node = node->rb_left; Assigning to an argument passed as const seems weird to me. I would think it shouldn't compile but it does. I guess my understanding of const is incomplete. Ya, that is due to const's binding: const struct rb_node *node1; // the thing pointed to is const const struct rb_node node2; // node is const struct rb_node *const node3; // node is const const struct rb_node *const node4; // both node and the thing // pointed too are const And so ends up being perfectly legal (I use the first case listed here). + else if (node->rb_right) + node = node->rb_right; + else + return (struct rb_node *)node; + } +} + +struct rb_node *rb_next_postorder(const struct rb_node *node) +{ + const struct rb_node *parent; + if (!node) + return NULL; + parent = rb_parent(node); Again here. Seth -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] rbtree: add rbtree_postorder_for_each_entry_safe() helper
On 07/29/2013 08:06 AM, Seth Jennings wrote: On Fri, Jul 26, 2013 at 02:13:40PM -0700, Cody P Schafer wrote: Because deletion (of the entire tree) is a relatively common use of the rbtree_postorder iteration, and because doing it safely means fiddling with temporary storage, provide a helper to simplify postorder rbtree iteration. Signed-off-by: Cody P Schafer --- include/linux/rbtree.h | 17 + 1 file changed, 17 insertions(+) diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index 2879e96..64ab98b 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -85,4 +85,21 @@ static inline void rb_link_node(struct rb_node * node, struct rb_node * parent, *rb_link = node; } +/** + * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of + * given type safe against removal of rb_node entry + * + * @pos: the 'type *' to use as a loop cursor. + * @n: another 'type *' to use as temporary storage + * @root: 'rb_root *' of the rbtree. + * @field: the name of the rb_node field within 'type'. + */ +#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \ + for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\ + n = rb_entry(rb_next_postorder(&pos->field), \ + typeof(*pos), field); \ +&pos->field; \ +pos = n, \ + n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field)) One too many spaces. Also mix of tabs and spaces is weird, but checkpatch doesn't complain so... Seth The extra space is to set off ';' vs ',' in the macro. And I did that instead of a tab to avoid wrapping. I've adjusted them (in the next version) to use the same style as list.h's list_for_each*() macros. Which results in more wrapping :( . + #endif/* _LINUX_RBTREE_H */ -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/5] rbtree: add postorder iteration functions
Add postorder iteration functions for rbtree. These are useful for safely freeing an entire rbtree without modifying the tree at all. Signed-off-by: Cody P Schafer Reviewed-by: Seth Jennings --- include/linux/rbtree.h | 4 lib/rbtree.c | 40 2 files changed, 44 insertions(+) diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index 0022c1b..c467151 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -68,6 +68,10 @@ extern struct rb_node *rb_prev(const struct rb_node *); extern struct rb_node *rb_first(const struct rb_root *); extern struct rb_node *rb_last(const struct rb_root *); +/* Postorder iteration - always visit the parent after its children */ +extern struct rb_node *rb_first_postorder(const struct rb_root *); +extern struct rb_node *rb_next_postorder(const struct rb_node *); + /* Fast replacement of a single node without remove/rebalance/add/rebalance */ extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root); diff --git a/lib/rbtree.c b/lib/rbtree.c index c0e31fe..65f4eff 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct rb_node *new, *new = *victim; } EXPORT_SYMBOL(rb_replace_node); + +static struct rb_node *rb_left_deepest_node(const struct rb_node *node) +{ + for (;;) { + if (node->rb_left) + node = node->rb_left; + else if (node->rb_right) + node = node->rb_right; + else + return (struct rb_node *)node; + } +} + +struct rb_node *rb_next_postorder(const struct rb_node *node) +{ + const struct rb_node *parent; + if (!node) + return NULL; + parent = rb_parent(node); + + /* If we're sitting on node, we've already seen our children */ + if (parent && node == parent->rb_left && parent->rb_right) { + /* If we are the parent's left node, go to the parent's right +* node then all the way down to the left */ + return rb_left_deepest_node(parent->rb_right); + } else + /* Otherwise we are the parent's right node, and the parent +* should be next */ + return (struct rb_node *)parent; +} +EXPORT_SYMBOL(rb_next_postorder); + +struct rb_node *rb_first_postorder(const struct rb_root *root) +{ + if (!root->rb_node) + return NULL; + + return rb_left_deepest_node(root->rb_node); +} +EXPORT_SYMBOL(rb_first_postorder); -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 5/5] mm/zswap: use postorder iteration when destroying rbtree
Signed-off-by: Cody P Schafer Reviewed-by: Seth Jennings --- mm/zswap.c | 16 ++-- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index deda2b6..5c853b2 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -790,26 +790,14 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset) static void zswap_frontswap_invalidate_area(unsigned type) { struct zswap_tree *tree = zswap_trees[type]; - struct rb_node *node; - struct zswap_entry *entry; + struct zswap_entry *entry, *n; if (!tree) return; /* walk the tree and free everything */ spin_lock(&tree->lock); - /* -* TODO: Even though this code should not be executed because -* the try_to_unuse() in swapoff should have emptied the tree, -* it is very wasteful to rebalance the tree after every -* removal when we are freeing the whole tree. -* -* If post-order traversal code is ever added to the rbtree -* implementation, it should be used here. -*/ - while ((node = rb_first(&tree->rbroot))) { - entry = rb_entry(node, struct zswap_entry, rbnode); - rb_erase(&entry->rbnode, &tree->rbroot); + rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) { zbud_free(tree->pool, entry->handle); zswap_entry_cache_free(entry); atomic_dec(&zswap_stored_pages); -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/5] Add rbtree postorder iteration functions, runtime tests, and update zswap to use
Postorder iteration yields all of a node's children prior to yielding the node itself, and this particular implementation also avoids examining the leaf links in a node after that node has been yielded. In what I expect will be its most common usage, postorder iteration allows the deletion of every node in an rbtree without modifying the rbtree nodes (no _requirement_ that they be nulled) while avoiding referencing child nodes after they have been "deleted" (most commonly, freed). I have only updated zswap to use this functionality at this point, but numerous bits of code (most notably in the filesystem drivers) use a hand rolled postorder iteration that NULLs child links as it traverses the tree. Each of those instances could be replaced with this common implementation. 1 & 2 add rbtree postorder iteration functions. 3 adds testing of the iteration to the rbtree runtime tests 4 allows building the rbtree runtime tests as builtins 5 updates zswap. -- since v1: - spacing - s/it's/its/ - remove now unused var in zswap code. - Reviewed-by: Seth Jennings Cody P Schafer (5): rbtree: add postorder iteration functions rbtree: add rbtree_postorder_for_each_entry_safe() helper rbtree_test: add test for postorder iteration rbtree: allow tests to run as builtin mm/zswap: use postorder iteration when destroying rbtree include/linux/rbtree.h | 22 ++ lib/Kconfig.debug | 2 +- lib/rbtree.c | 40 lib/rbtree_test.c | 12 mm/zswap.c | 16 ++-- 5 files changed, 77 insertions(+), 15 deletions(-) -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/5] rbtree_test: add test for postorder iteration
Just check that we examine all nodes in the tree for the postorder iteration. Signed-off-by: Cody P Schafer Reviewed-by: Seth Jennings --- lib/rbtree_test.c | 12 1 file changed, 12 insertions(+) diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c index 122f02f..31dd4cc 100644 --- a/lib/rbtree_test.c +++ b/lib/rbtree_test.c @@ -114,6 +114,16 @@ static int black_path_count(struct rb_node *rb) return count; } +static void check_postorder(int nr_nodes) +{ + struct rb_node *rb; + int count = 0; + for (rb = rb_first_postorder(&root); rb; rb = rb_next_postorder(rb)) + count++; + + WARN_ON_ONCE(count != nr_nodes); +} + static void check(int nr_nodes) { struct rb_node *rb; @@ -136,6 +146,8 @@ static void check(int nr_nodes) WARN_ON_ONCE(count != nr_nodes); WARN_ON_ONCE(count < (1 << black_path_count(rb_last(&root))) - 1); + + check_postorder(nr_nodes); } static void check_augmented(int nr_nodes) -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/5] rbtree: add rbtree_postorder_for_each_entry_safe() helper
Because deletion (of the entire tree) is a relatively common use of the rbtree_postorder iteration, and because doing it safely means fiddling with temporary storage, provide a helper to simplify postorder rbtree iteration. Signed-off-by: Cody P Schafer Reviewed-by: Seth Jennings --- include/linux/rbtree.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index c467151..aa870a4 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -85,4 +85,22 @@ static inline void rb_link_node(struct rb_node * node, struct rb_node * parent, *rb_link = node; } +/** + * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of + * given type safe against removal of rb_node entry + * + * @pos: the 'type *' to use as a loop cursor. + * @n: another 'type *' to use as temporary storage + * @root: 'rb_root *' of the rbtree. + * @field: the name of the rb_node field within 'type'. + */ +#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \ + for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\ + n = rb_entry(rb_next_postorder(&pos->field), \ + typeof(*pos), field); \ +&pos->field; \ +pos = n, \ + n = rb_entry(rb_next_postorder(&pos->field), \ + typeof(*pos), field)) + #endif /* _LINUX_RBTREE_H */ -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 4/5] rbtree: allow tests to run as builtin
No reason require rbtree test code to be a module, allow it to be builtin (streamlines my development process) Signed-off-by: Cody P Schafer Reviewed-by: Seth Jennings --- lib/Kconfig.debug | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 1501aa5..606e3c8 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1442,7 +1442,7 @@ config BACKTRACE_SELF_TEST config RBTREE_TEST tristate "Red-Black tree test" - depends on m && DEBUG_KERNEL + depends on DEBUG_KERNEL help A benchmark measuring the performance of the rbtree library. Also includes rbtree invariant checks. -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hotplug: fix a drain pcp bug when offline pages
On 08/01/2013 02:18 AM, Xishi Qiu wrote: __offline_pages() start_isolate_page_range() set_migratetype_isolate() set_pageblock_migratetype() -> this pageblock will be marked as MIGRATE_ISOLATE move_freepages_block() -> pages in PageBuddy will be moved into MIGRATE_ISOLATE list drain_all_pages() -> drain PCP free_pcppages_bulk() mt = get_freepage_migratetype(page); -> PCP's migratetype is not MIGRATE_ISOLATE __free_one_page(page, zone, 0, mt); -> so PCP will not be freed into into MIGRATE_ISOLATE list In this case, the PCP may be allocated again, because they are not in PageBuddy's MIGRATE_ISOLATE list. This will cause offline_pages failed. Signed-off-by: Xishi Qiu --- mm/page_alloc.c | 10 ++ mm/page_isolation.c | 15 ++- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b100255..d873471 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -965,11 +965,13 @@ int move_freepages(struct zone *zone, } order = page_order(page); - list_move(&page->lru, - &zone->free_area[order].free_list[migratetype]); - set_freepage_migratetype(page, migratetype); + if (get_freepage_migratetype(page) != migratetype) { + list_move(&page->lru, + &zone->free_area[order].free_list[migratetype]); + set_freepage_migratetype(page, migratetype); + pages_moved += 1 << order; + } page += 1 << order; - pages_moved += 1 << order; So this looks like it changes the return from move_freepages() to be the "pages moved" from "the pages now belonging to the passed migrate type". The user of move_freepages_block()'s return value (and thus the return value of move_freepages()) in mm/page_alloc.c expects that it is the original meaning. The users in page_isolation.c expect it is the new meaning. Those need to be reconciled. } return pages_moved; diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 383bdbb..ba1afc9 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -65,8 +65,21 @@ out: } spin_unlock_irqrestore(&zone->lock, flags); - if (!ret) + + if (!ret) { drain_all_pages(); + /* +* When drain_all_pages() frees cached pages into the buddy +* system, it uses the stale migratetype cached in the +* page->index field, so try to move free pages to ISOLATE +* list again. +*/ + spin_lock_irqsave(&zone->lock, flags); + nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE); + __mod_zone_freepage_state(zone, -nr_pages, migratetype); + spin_unlock_irqrestore(&zone->lock, flags); + } + Could we teach drain_all_pages() to use the right migrate type instead (or add something similar that does)? (pages could be reallocated between the drain_all_pages() and move_freepages_block()). return ret; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] mm: make lru_add_drain_all() selective
On 08/06/2013 01:22 PM, Chris Metcalf wrote: [...] /** + * schedule_on_each_cpu - execute a function synchronously on each online CPU + * @func: the function to call + * + * schedule_on_each_cpu() executes @func on each online CPU using the + * system workqueue and blocks until all CPUs have completed. + * schedule_on_each_cpu() is very slow. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int schedule_on_each_cpu(work_func_t func) +{ + get_online_cpus(); + schedule_on_cpu_mask(func, cpu_online_mask); Looks like the return value gets lost here. + put_online_cpus(); + return 0; +} + +/** * flush_scheduled_work - ensure that any scheduled work has run to completion. * * Forces execution of the kernel-global workqueue and blocks until its [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 13/21] x86, acpi: Try to find SRAT in firmware earlier.
On 07/19/2013 12:59 AM, Tang Chen wrote: This patch introduce early_acpi_firmware_srat() to find the phys addr of SRAT provided by firmware. And call it in reserve_hotpluggable_memory(). Since we have initialized acpi_gbl_root_table_list earlier, and store all the tables' phys addrs and signatures in it, it is easy to find the SRAT. Signed-off-by: Tang Chen --- drivers/acpi/acpica/tbxface.c | 34 ++ drivers/acpi/osl.c| 24 include/acpi/acpixf.h |4 include/linux/acpi.h |4 mm/memory_hotplug.c | 10 +++--- 5 files changed, 73 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/acpica/tbxface.c b/drivers/acpi/acpica/tbxface.c index ad11162..95f8d1b 100644 --- a/drivers/acpi/acpica/tbxface.c +++ b/drivers/acpi/acpica/tbxface.c @@ -181,6 +181,40 @@ acpi_status acpi_reallocate_root_table(void) return_ACPI_STATUS(status); } +/* + * acpi_get_table_desc - Get the acpi table descriptor of a specific table. + * @signature: The signature of the table to be found. + * @out_desc: The out returned descriptor. The "@out_desc:" line looks funky. Also, I believe changes to this file need to go in via acpica & probably conform to their commenting standards? + * + * This function iterates acpi_gbl_root_table_list and find the specified + * table's descriptor. + * + * NOTE: The caller has the responsibility to allocate memory for @out_desc. + * + * Return AE_OK on success, AE_NOT_FOUND if the table is not found. + */ +acpi_status acpi_get_table_desc(char *signature, + struct acpi_table_desc *out_desc) +{ + int pos; + + for (pos = 0; +pos < acpi_gbl_root_table_list.current_table_count; +pos++) { + if (!ACPI_COMPARE_NAME + (&(acpi_gbl_root_table_list.tables[pos].signature), + signature)) + continue; + + memcpy(out_desc, &acpi_gbl_root_table_list.tables[pos], + sizeof(struct acpi_table_desc)); + + return_ACPI_STATUS(AE_OK); + } + + return_ACPI_STATUS(AE_NOT_FOUND); +} + /*** * * FUNCTION:acpi_get_table_header diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index fa6b973..a2e4596 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -53,6 +53,7 @@ #include #include #include +#include #define _COMPONENTACPI_OS_SERVICES ACPI_MODULE_NAME("osl"); @@ -750,6 +751,29 @@ void __init acpi_initrd_override(void *data, size_t size) } #endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */ +#ifdef CONFIG_ACPI_NUMA +#include +#include + +/* + * early_acpi_firmware_srat - Get the phys addr of SRAT provide by firmware. s/provide/provided/ + * + * This function iterate acpi_gbl_root_table_list, find SRAT and return the Perhaps: "Iterate over acpi_gbl_root_table_list to find SRAT then return its phys addr" Though I wonder if this comment is even needed, as the iteration is done in acpi_get_table_desc() (added above). + * phys addr of SRAT. + * + * Return the phys addr of SRAT, or 0 on error. + */ +phys_addr_t __init early_acpi_firmware_srat() +{ + struct acpi_table_desc table_desc; + + if (acpi_get_table_desc(ACPI_SIG_SRAT, &table_desc)) + return 0; + + return table_desc.address; +} +#endif /* CONFIG_ACPI_NUMA */ + static void acpi_table_taint(struct acpi_table_header *table) { pr_warn(PREFIX [...] diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 066873e..15b11d3 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -106,10 +106,14 @@ void __init reserve_hotpluggable_memory(void) { phys_addr_t srat_paddr; - /* Try to find if SRAT is overrided */ + /* Try to find out if SRAT is overrided */ srat_paddr = early_acpi_override_srat(); - if (!srat_paddr) - return; + if (!srat_paddr) { + /* Try to find SRAT from firmware if it wasn't overrided */ s/overrided/overridden/ + srat_paddr = early_acpi_firmware_srat(); + if (!srat_paddr) + return; + } /* Will reserve hotpluggable memory here */ } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch
"Problems" with the current code: 1. there is a lack of synchronization in setting ->high and ->batch in percpu_pagelist_fraction_sysctl_handler() 2. stop_machine() in zone_pcp_update() is unnecissary. 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero To fix: 1. add memory barriers, a safe ->batch value, an update side mutex when updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that expect a stable value. 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1 3. factor out quite a few functions, and then call the appropriate one. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I'm rather certain that I've diserned (and preserved) the essential behavior (changing ->high and ->batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. Further note that the draining of pages that previously took place in zone_pcp_update() occured after repeated draining when attempting to offline a page, and after the offline has "succeeded". It appears that the draining was added to zone_pcp_update() to avoid refactoring setup_pageset() into 2 funtions. -- Since v2: https://lkml.org/lkml/2013/4/9/718 - note ACCESS_ONCE() in fix #1 above. - consolidate ->batch & ->high update protocol into a single funtion (Gilad). - add missing ACCESS_ONCE() on ->batch Since v1: https://lkml.org/lkml/2013/4/5/444 - instead of using on_each_cpu(), use memory barriers (Gilad) and an update side mutex. - add "Problem" #3 above, and fix. - rename function to match naming style of similar function - move unrelated comment -- Cody P Schafer (11): mm/page_alloc: factor out setting of pcp->high and pcp->batch. mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high mm/page_alloc: insert memory barriers to allow async update of pcp batch and high mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine() mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() mm/page_alloc: relocate comment to be directly above code it refers to. mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() mm/page_alloc.c | 151 +--- 1 file changed, 90 insertions(+), 61 deletions(-) -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 01/11] mm/page_alloc: factor out setting of pcp->high and pcp->batch.
Creates pageset_set_batch() for use in setup_pageset(). pageset_set_batch() imitates the functionality of setup_pagelist_highmark(), but uses the boot time (percpu_pagelist_fraction == 0) calculations for determining ->high based on ->batch. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 98cbdf6..9e556e6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4030,6 +4030,14 @@ static int __meminit zone_batchsize(struct zone *zone) #endif } +/* a companion to setup_pagelist_highmark() */ +static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) +{ + struct per_cpu_pages *pcp = &p->pcp; + pcp->high = 6 * batch; + pcp->batch = max(1UL, 1 * batch); +} + static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) { struct per_cpu_pages *pcp; @@ -4039,8 +4047,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) pcp = &p->pcp; pcp->count = 0; - pcp->high = 6 * batch; - pcp->batch = max(1UL, 1 * batch); + pageset_set_batch(p, batch); for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) INIT_LIST_HEAD(&pcp->lists[migratetype]); } @@ -4049,7 +4056,6 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist * to the value high for the pageset p. */ - static void setup_pagelist_highmark(struct per_cpu_pageset *p, unsigned long high) { -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 02/11] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
Because we are going to rely upon a careful transision between old and new ->high and ->batch values using memory barriers and will remove stop_machine(), we need to prevent multiple updaters from interweaving their memory writes. Add a simple mutex to protect both update loops. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9e556e6..cea883d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -65,6 +65,9 @@ #include #include "internal.h" +/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ +static DEFINE_MUTEX(pcp_batch_high_lock); + #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); @@ -,6 +5558,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, ret = proc_dointvec_minmax(table, write, buffer, length, ppos); if (!write || (ret < 0)) return ret; + + mutex_lock(&pcp_batch_high_lock); for_each_populated_zone(zone) { for_each_possible_cpu(cpu) { unsigned long high; @@ -5563,6 +5568,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, per_cpu_ptr(zone->pageset, cpu), high); } } + mutex_unlock(&pcp_batch_high_lock); return 0; } @@ -6076,7 +6082,9 @@ static int __meminit __zone_pcp_update(void *data) void __meminit zone_pcp_update(struct zone *zone) { + mutex_lock(&pcp_batch_high_lock); stop_machine(__zone_pcp_update, zone, NULL); + mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 05/11] mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()
zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a percpu pageset based on a zone's ->managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. This lets us avoid calling setup_pageset() (and the draining required to call it) and instead allows simply setting the fields' values (with some attention paid to memory barriers to prevent the relationship between ->batch and ->high from being thrown off). This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 33 + 1 file changed, 9 insertions(+), 24 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 71d843d..5a00195 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6083,33 +6083,18 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages) #endif #ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit __zone_pcp_update(void *data) -{ - struct zone *zone = data; - int cpu; - unsigned long batch = zone_batchsize(zone), flags; - - for_each_possible_cpu(cpu) { - struct per_cpu_pageset *pset; - struct per_cpu_pages *pcp; - - pset = per_cpu_ptr(zone->pageset, cpu); - pcp = &pset->pcp; - - local_irq_save(flags); - if (pcp->count > 0) - free_pcppages_bulk(zone, pcp->count, pcp); - drain_zonestat(zone, pset); - setup_pageset(pset, batch); - local_irq_restore(flags); - } - return 0; -} - +/* + * The zone indicated has a new number of managed_pages; batch sizes and percpu + * page high values need to be recalulated. + */ void __meminit zone_pcp_update(struct zone *zone) { + unsigned cpu; + unsigned long batch; mutex_lock(&pcp_batch_high_lock); - stop_machine(__zone_pcp_update, zone, NULL); + batch = zone_batchsize(zone); + for_each_possible_cpu(cpu) + pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch); mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 07/11] mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3583281..696ce96 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4069,7 +4069,7 @@ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch)); } -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) +static void pageset_init(struct per_cpu_pageset *p) { struct per_cpu_pages *pcp; int migratetype; @@ -4078,11 +4078,16 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) pcp = &p->pcp; pcp->count = 0; - pageset_set_batch(p, batch); for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) INIT_LIST_HEAD(&pcp->lists[migratetype]); } +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) +{ + pageset_init(p); + pageset_set_batch(p, batch); +} + /* * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist * to the value high for the pageset p. -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 09/11] mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 27 +++ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 53c62c5..0c3cdbb6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4102,22 +4102,25 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p, pageset_update(&p->pcp, high, batch); } +static void __meminit zone_pageset_init(struct zone *zone, int cpu) +{ + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu); + + pageset_init(pcp); + if (percpu_pagelist_fraction) + setup_pagelist_highmark(pcp, + (zone->managed_pages / + percpu_pagelist_fraction)); + else + pageset_set_batch(pcp, zone_batchsize(zone)); +} + static void __meminit setup_zone_pageset(struct zone *zone) { int cpu; - zone->pageset = alloc_percpu(struct per_cpu_pageset); - - for_each_possible_cpu(cpu) { - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu); - - setup_pageset(pcp, zone_batchsize(zone)); - - if (percpu_pagelist_fraction) - setup_pagelist_highmark(pcp, - (zone->managed_pages / - percpu_pagelist_fraction)); - } + for_each_possible_cpu(cpu) + zone_pageset_init(zone, cpu); } /* -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 10/11] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
Previously, zone_pcp_update() called pageset_set_batch() directly, essentially assuming that percpu_pagelist_fraction == 0. Correct this by calling zone_pageset_init(), which chooses the appropriate ->batch and ->high calculations. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0c3cdbb6..251fb5f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6096,11 +6096,9 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages) void __meminit zone_pcp_update(struct zone *zone) { unsigned cpu; - unsigned long batch; mutex_lock(&pcp_batch_high_lock); - batch = zone_batchsize(zone); for_each_possible_cpu(cpu) - pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch); + zone_pageset_init(zone, cpu); mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 08/11] mm/page_alloc: relocate comment to be directly above code it refers to.
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 696ce96..53c62c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3709,12 +3709,12 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) mminit_verify_zonelist(); cpuset_init_current_mems_allowed(); } else { - /* we have to stop all cpus to guarantee there is no user - of zonelist */ #ifdef CONFIG_MEMORY_HOTPLUG if (zone) setup_zone_pageset(zone); #endif + /* we have to stop all cpus to guarantee there is no user + of zonelist */ stop_machine(__build_all_zonelists, pgdat, NULL); /* cpuset refresh routine should be here */ } -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 11/11] mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 251fb5f..b335c98 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4063,7 +4063,7 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high, pcp->batch = batch; } -/* a companion to setup_pagelist_highmark() */ +/* a companion to pageset_set_high() */ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) { pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch)); @@ -4089,10 +4089,10 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) } /* - * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist + * pageset_set_high() sets the high water mark for hot per_cpu_pagelist * to the value high for the pageset p. */ -static void setup_pagelist_highmark(struct per_cpu_pageset *p, +static void pageset_set_high(struct per_cpu_pageset *p, unsigned long high) { unsigned long batch = max(1UL, high / 4); @@ -4108,7 +4108,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu) pageset_init(pcp); if (percpu_pagelist_fraction) - setup_pagelist_highmark(pcp, + pageset_set_high(pcp, (zone->managed_pages / percpu_pagelist_fraction)); else @@ -5597,8 +5597,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, unsigned long high; high = zone->managed_pages / percpu_pagelist_fraction; for_each_possible_cpu(cpu) - setup_pagelist_highmark( - per_cpu_ptr(zone->pageset, cpu), high); + pageset_set_high(per_cpu_ptr(zone->pageset, cpu), +high); } mutex_unlock(&pcp_batch_high_lock); return 0; -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 06/11] mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high
Simply moves calculation of the new 'high' value outside the for_each_possible_cpu() loop, as it does not depend on the cpu. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5a00195..3583281 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5573,7 +5573,6 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write, * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist * can have before it gets flushed back to buddy allocator. */ - int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { @@ -5587,12 +5586,11 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, mutex_lock(&pcp_batch_high_lock); for_each_populated_zone(zone) { - for_each_possible_cpu(cpu) { - unsigned long high; - high = zone->managed_pages / percpu_pagelist_fraction; + unsigned long high; + high = zone->managed_pages / percpu_pagelist_fraction; + for_each_possible_cpu(cpu) setup_pagelist_highmark( - per_cpu_ptr(zone->pageset, cpu), high); - } + per_cpu_ptr(zone->pageset, cpu), high); } mutex_unlock(&pcp_batch_high_lock); return 0; -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 03/11] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high
Introduce pageset_update() to perform a safe transision from one set of pcp->{batch,high} to a new set using memory barriers. This ensures that batch is always set to a safe value (1) prior to updating high, and ensure that high is fully updated before setting the real value of batch. It avoids ->batch ever rising above ->high. Suggested by Gilad Ben-Yossef in these threads: https://lkml.org/lkml/2013/4/9/23 https://lkml.org/lkml/2013/4/10/49 Also reproduces his proposed comment. Reviewed-by: Gilad Ben-Yossef Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 41 - 1 file changed, 32 insertions(+), 9 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cea883d..7e45b91 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4033,12 +4033,37 @@ static int __meminit zone_batchsize(struct zone *zone) #endif } +/* + * pcp->high and pcp->batch values are related and dependent on one another: + * ->batch must never be higher then ->high. + * The following function updates them in a safe manner without read side + * locking. + * + * Any new users of pcp->batch and pcp->high should ensure they can cope with + * those fields changing asynchronously (acording the the above rule). + * + * mutex_is_locked(&pcp_batch_high_lock) required when calling this function + * outside of boot time (or some other assurance that no concurrent updaters + * exist). + */ +static void pageset_update(struct per_cpu_pages *pcp, unsigned long high, + unsigned long batch) +{ + /* start with a fail safe value for batch */ + pcp->batch = 1; + smp_wmb(); + + /* Update high, then batch, in order */ + pcp->high = high; + smp_wmb(); + + pcp->batch = batch; +} + /* a companion to setup_pagelist_highmark() */ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) { - struct per_cpu_pages *pcp = &p->pcp; - pcp->high = 6 * batch; - pcp->batch = max(1UL, 1 * batch); + pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch)); } static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) @@ -4062,13 +4087,11 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) static void setup_pagelist_highmark(struct per_cpu_pageset *p, unsigned long high) { - struct per_cpu_pages *pcp; + unsigned long batch = max(1UL, high / 4); + if ((high / 4) > (PAGE_SHIFT * 8)) + batch = PAGE_SHIFT * 8; - pcp = &p->pcp; - pcp->high = high; - pcp->batch = max(1UL, high/4); - if ((high/4) > (PAGE_SHIFT * 8)) - pcp->batch = PAGE_SHIFT * 8; + pageset_update(&p->pcp, high, batch); } static void __meminit setup_zone_pageset(struct zone *zone) -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND v3 04/11] mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE
pcp->batch could change at any point, avoid relying on it being a stable value. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e45b91..71d843d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1182,10 +1182,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) { unsigned long flags; int to_drain; + unsigned long batch; local_irq_save(flags); - if (pcp->count >= pcp->batch) - to_drain = pcp->batch; + batch = ACCESS_ONCE(pcp->batch); + if (pcp->count >= batch) + to_drain = batch; else to_drain = pcp->count; if (to_drain > 0) { @@ -1353,8 +1355,9 @@ void free_hot_cold_page(struct page *page, int cold) list_add(&page->lru, &pcp->lists[migratetype]); pcp->count++; if (pcp->count >= pcp->high) { - free_pcppages_bulk(zone, pcp->batch, pcp); - pcp->count -= pcp->batch; + unsigned long batch = ACCESS_ONCE(pcp->batch); + free_pcppages_bulk(zone, batch, pcp); + pcp->count -= batch; } out: -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch
On 05/13/2013 12:20 PM, Pekka Enberg wrote: Hi Cody, On Mon, May 13, 2013 at 10:08 PM, Cody P Schafer wrote: "Problems" with the current code: 1. there is a lack of synchronization in setting ->high and ->batch in percpu_pagelist_fraction_sysctl_handler() 2. stop_machine() in zone_pcp_update() is unnecissary. 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero Maybe it's just me but I find the above problem description confusing. How does the problem manifest itself? 1. I've not reproduced this causing issues. 2. Calling zone_pcp_update() is slow. 3. Not reproduced either, but would cause percpu_pagelist_fraction (set via sysctl) to be ignored after a call to zone_pcp_update() (for example, after a memory hotplug). How did you find about it? I'm writing some code that resizes zones and thus uses zone_pcp_update(), and fixing broken things along the way. Why do we need to fix all three problems in the same patch set? They all affect the same bit of code and fixing all of them means restructuring the both of the sites where ->high and ->batch are set. Additionally, splitting it out (if possible) would make it less clear what the overall goal is, and would mean a few inter-patchset dependencies, which are undesirable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/4] misc patches related to resizing nodes & zones
First 2 are comment fixes. Second 2 add pgdat_resize_lock()/unlock() usage per existing documentation. -- Since v2 (http://comments.gmane.org/gmane.linux.kernel.mm/99316): - add ack on patch 1 from rientjes. - quote documentation in patch 3 & 4. -- Since v1 (http://thread.gmane.org/gmane.linux.kernel.mm/99297): - drop making lock_memory_hotplug() required (old patch #1) - fix __offline_pages() in the same manner as online_pages() (rientjes) - make comment regarding pgdat_resize_lock()/unlock() usage more clear (rientjes) Cody P Schafer (4): mm: fix comment referring to non-existent size_seqlock, change to span_seqlock mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock() memory_hotplug: use pgdat_resize_lock() in online_pages() memory_hotplug: use pgdat_resize_lock() in __offline_pages() include/linux/mmzone.h | 5 - mm/memory_hotplug.c| 9 + 2 files changed, 13 insertions(+), 1 deletion(-) -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/4] mm: fix comment referring to non-existent size_seqlock, change to span_seqlock
Signed-off-by: Cody P Schafer Acked-by: David Rientjes --- include/linux/mmzone.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5c76737..fc859a0c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -716,7 +716,7 @@ typedef struct pglist_data { * or node_spanned_pages stay constant. Holding this will also * guarantee that any pfn_valid() stays that way. * -* Nests above zone->lock and zone->size_seqlock. +* Nests above zone->lock and zone->span_seqlock */ spinlock_t node_size_lock; #endif -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 2/4] mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()
Signed-off-by: Cody P Schafer --- include/linux/mmzone.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fc859a0c..41557be 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -716,6 +716,9 @@ typedef struct pglist_data { * or node_spanned_pages stay constant. Holding this will also * guarantee that any pfn_valid() stays that way. * +* pgdat_resize_lock() and pgdat_resize_unlock() are provided to +* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG. +* * Nests above zone->lock and zone->span_seqlock */ spinlock_t node_size_lock; -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 3/4] memory_hotplug: use pgdat_resize_lock() in online_pages()
mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as follows: * Must be held any time you expect node_start_pfn, node_present_pages * or node_spanned_pages stay constant. [...] So actually hold it when we update node_present_pages in online_pages(). Signed-off-by: Cody P Schafer --- mm/memory_hotplug.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a221fac..0bdca10 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -915,6 +915,7 @@ static void node_states_set_node(int node, struct memory_notify *arg) int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type) { + unsigned long flags; unsigned long onlined_pages = 0; struct zone *zone; int need_zonelists_rebuild = 0; @@ -993,7 +994,11 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ zone->managed_pages += onlined_pages; zone->present_pages += onlined_pages; + + pgdat_resize_lock(zone->zone_pgdat, &flags); zone->zone_pgdat->node_present_pages += onlined_pages; + pgdat_resize_unlock(zone->zone_pgdat, &flags); + if (onlined_pages) { node_states_set_node(zone_to_nid(zone), &arg); if (need_zonelists_rebuild) -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 4/4] memory_hotplug: use pgdat_resize_lock() in __offline_pages()
mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as follows: * Must be held any time you expect node_start_pfn, node_present_pages * or node_spanned_pages stay constant. [...] So actually hold it when we update node_present_pages in __offline_pages(). Signed-off-by: Cody P Schafer --- mm/memory_hotplug.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0bdca10..b59a695 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1582,7 +1582,11 @@ repeat: /* removal success */ zone->managed_pages -= offlined_pages; zone->present_pages -= offlined_pages; + + pgdat_resize_lock(zone->zone_pgdat, &flags); zone->zone_pgdat->node_present_pages -= offlined_pages; + pgdat_resize_unlock(zone->zone_pgdat, &flags); + totalram_pages -= offlined_pages; init_per_zone_wmark_min(); -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] checkpatch: avoid warning on page flag acessors
Using Page*() triggers a camelcase warning, but shouldn't. Introduced by be987d9f80354e2e919926349282facd74992f90, which added the other Page flag users. Pipe ('|') at the end of a grouping doesn't cause the grouping to match an emtpy string, use ? after the grouping instead to get the desired behavior. Signed-off-by: Cody P Schafer --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index b954de5..ee24026 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2939,7 +2939,7 @@ sub process { my $var = $1; if ($var !~ /$Constant/ && $var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ && - $var !~ /"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ && + $var !~ /"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ && !defined $camelcase{$var}) { $camelcase{$var} = 1; WARN("CAMELCASE", -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] checkpatch: avoid warning on page flag acessors
On 05/14/2013 09:45 AM, Cody P Schafer wrote: Using Page*() triggers a camelcase warning, but shouldn't. Introduced by be987d9f80354e2e919926349282facd74992f90, which added the other Page flag users. Pipe ('|') at the end of a grouping doesn't cause the grouping to match an emtpy string, use ? after the grouping instead to get the desired behavior. Signed-off-by: Cody P Schafer --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index b954de5..ee24026 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2939,7 +2939,7 @@ sub process { my $var = $1; if ($var !~ /$Constant/ && $var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ && - $var !~ /"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ && + $var !~ /"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ && !defined $camelcase{$var}) { $camelcase{$var} = 1; WARN("CAMELCASE", Nix this, appears to still be broken WARNING: Avoid CamelCase: #51: FILE: mm/page_alloc.c:449: + VM_BUG_ON(!PageBuddy(page)); total: 0 errors, 1 warnings, 46 lines checked /home/cody/mm/0008-page_alloc-add-return_pages_to_zone-when-DYNAMIC_NUM.patch has style problems, please review. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] checkpatch: avoid warning on page flag acessors
Using Page*() triggers a camelcase warning, but shouldn't. be987d9f80354e2e919926349282facd74992f90 added a spurious '"' (double quote) breaking the regex. Signed-off-by: Cody P Schafer --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index b954de5..ee24026 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2939,7 +2939,7 @@ sub process { my $var = $1; if ($var !~ /$Constant/ && $var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ && - $var !~ /"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ && + $var !~ /"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ && !defined $camelcase{$var}) { $camelcase{$var} = 1; WARN("CAMELCASE", -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3] checkpatch: avoid warning on page flag acessors
Using Page*() triggers a camelcase warning, but shouldn't. be987d9f80354e2e919926349282facd74992f90 added a spurious '"' (double quote) breaking the regex. Signed-off-by: Cody P Schafer --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- Appologies for the noise, I updated my commit message but not the patch in the previous version. diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index b954de5..ee24026 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2939,7 +2939,7 @@ sub process { my $var = $1; if ($var !~ /$Constant/ && $var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ && - $var !~ /"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ && + $var !~ /"^(?:Clear|Set|TestClear|TestSet)?Page[A-Z]/ && !defined $camelcase{$var}) { $camelcase{$var} = 1; WARN("CAMELCASE", -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] checkpatch: avoid warning on page flag acessors
On 05/14/2013 10:13 AM, Joe Perches wrote: I think you need to delete the ", leave the | and remove the ? Dunno where that " came from. Ya. commit goof on my part, see v3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4] checkpatch: avoid warning on page flag acessors
Using Page*() triggers a camelcase warning, but shouldn't. be987d9f80354e2e919926349282facd74992f90 added a spurious '"' (double quote) breaking the regex. Signed-off-by: Cody P Schafer --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- And using sha1s in format-patch means I have to actually copy the new ones. (Again, sorry). diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index b954de5..d2fe01d 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2939,7 +2939,7 @@ sub process { my $var = $1; if ($var !~ /$Constant/ && $var =~ /[A-Z]\w*[a-z]|[a-z]\w*[A-Z]/ && - $var !~ /"^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ && + $var !~ /^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ && !defined $camelcase{$var}) { $camelcase{$var} = 1; WARN("CAMELCASE", -- 1.8.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch
On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote: > (4/5/13 4:33 PM), Cody P Schafer wrote: >> In one case while modifying the ->high and ->batch fields of per cpu pagesets >> we're unneededly using stop_machine() (patches 1 & 2), and in another we >> don't have any >> syncronization at all (patch 3). >> >> This patchset fixes both of them. >> >> Note that it results in a change to the behavior of zone_pcp_update(), which >> is >> used by memory_hotplug. I _think_ that I've diserned (and preserved) the >> essential behavior (changing ->high and ->batch), and only eliminated >> unneeded >> actions (draining the per cpu pages), but this may not be the case. > > at least, memory hotplug need to drain. Could you explain why the drain is required here? From what I can tell, after the stop_machine() completes, the per cpu page sets could be repopulated at any point, making the combination of draining and modifying ->batch & ->high uneeded. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote: On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer wrote: In free_hot_cold_page(), we rely on pcp->batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. Change for_each_cpu() to on_each_cpu() to fix. Are you referring to this? - This was the case I noticed. 1329 if (pcp->count >= pcp->high) { 1330 free_pcppages_bulk(zone, pcp->batch, pcp); 1331 pcp->count -= pcp->batch; 1332 } I'm probably missing the obvious but won't it be simpler to do this in free_hot_cold_page() - 1329 if (pcp->count >= pcp->high) { 1330 unsigned int batch = ACCESS_ONCE(pcp->batch); 1331 free_pcppages_bulk(zone, batch, pcp); 1332 pcp->count -= batch; 1333 } Potentially, yes. Note that this was simply the one case I noticed, rather than certainly the only case. I also wonder whether there could be unexpected interactions between ->high and ->batch not changing together atomically. For example, could adjusting this knob cause ->batch to rise enough that it is greater than the previous ->high? If the code above then runs with the previous ->high, ->count wouldn't be correct (checking this inside free_pcppages_bulk() might help on this one issue). Now the batch value used is stable and you don't have to IPI every CPU in the system just to change a config knob... Is this really considered an issue? I wouldn't have expected someone to adjust the config knob often enough (or even more than once) to cause problems. Of course as a "It'd be nice" thing, I completely agree. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote: > (4/5/13 4:33 PM), Cody P Schafer wrote: >> No off-cpu users of the percpu pagesets exist. >> >> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a >> percpu pageset based on a zone's ->managed_pages. We don't need to drain >> the entire percpu pageset just to modify these fields. Avoid calling >> setup_pageset() (and the draining required to call it) and instead just >> set the fields' values. >> >> This does change the behavior of zone_pcp_update() as the percpu >> pagesets will not be drained when zone_pcp_update() is called (they will >> end up being shrunk, not completely drained, later when a 0-order page >> is freed in free_hot_cold_page()). >> >> Signed-off-by: Cody P Schafer > > NAK. > > 1) zone_pcp_update() is only used from memory hotplug and it require page > drain. I'm looking at this code because I'm currently working on a patchset which adds another interface which modifies zone sizes, so "only used from memory hotplug" is a temporary thing (unless I discover that zone_pcp_update() is not intended to do what I want it to do). > 2) stop_machin is used for avoiding race. just removing it is insane. What race? Is there a cross cpu access to ->high & ->batch that makes using on_each_cpu() instead of stop_machine() inappropriate? It is absolutely not just being removed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
On 04/06/2013 06:56 PM, Simon Jeons wrote: Hi Cody, On 04/06/2013 04:33 AM, Cody P Schafer wrote: In free_hot_cold_page(), we rely on pcp->batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. If cpu is off, can its pcp pageset be used in free_hot_code_page()? I don't expect it to be as we use this_cpu_ptr() to access the pcp pageset. Unless there is some crazy mode where we can override the cpu a task believes it is running on... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] mm/page_alloc: factor out setting of pcp->high and pcp->batch.
On 04/06/2013 06:37 PM, Simon Jeons wrote: Hi Cody, On 04/06/2013 04:33 AM, Cody P Schafer wrote: Creates pageset_set_batch() for use in setup_pageset(). pageset_set_batch() imitates the functionality of setup_pagelist_highmark(), but uses the boot time (percpu_pagelist_fraction == 0) calculations for determining ->high Why need adjust pcp->high, pcp->batch during system running? What's the requirement? There is currently a sysctl (which I patch later in this series) which allows adjusting the ->high mark (and, indirectly, ->batch). Additionally, memory hotplug changes ->high and ->batch due to the zone size changing (essentially, zone->managed_pages and zone->present_pages have changed) , meaning that zone_batchsize(), which is used at boot to set ->batch and (indirectly) ->high has a different output. Note that in addition to the 2 users of this functionality mentioned here, I'm currently working on anther resizer of zones (runtime NUMA reconfiguration). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch
On 04/06/2013 06:32 PM, Simon Jeons wrote: Hi Cody, On 04/06/2013 04:33 AM, Cody P Schafer wrote: In one case while modifying the ->high and ->batch fields of per cpu pagesets we're unneededly using stop_machine() (patches 1 & 2), and in another we don't have any syncronization at all (patch 3). Do you mean stop_machine() is used for syncronization between each online cpu? I mean that it looks like synchronization between cpus is unneeded because of how per cpu pagesets are used, so stop_machine() (which does provide syncro between all cpus) is unnecessarily "strong". -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote: (4/8/13 1:32 PM), Cody P Schafer wrote: On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote: (4/5/13 4:33 PM), Cody P Schafer wrote: No off-cpu users of the percpu pagesets exist. zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a percpu pageset based on a zone's ->managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. Avoid calling setup_pageset() (and the draining required to call it) and instead just set the fields' values. This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer NAK. 1) zone_pcp_update() is only used from memory hotplug and it require page drain. I'm looking at this code because I'm currently working on a patchset which adds another interface which modifies zone sizes, so "only used from memory hotplug" is a temporary thing (unless I discover that zone_pcp_update() is not intended to do what I want it to do). maybe yes, maybe no. I don't know temporary or not. However the fact is, you must not break anywhere. You need to look all caller always. Right, which is why I want to understand memory hotplug's actual requirements. 2) stop_machin is used for avoiding race. just removing it is insane. What race? Is there a cross cpu access to ->high & ->batch that makes using on_each_cpu() instead of stop_machine() inappropriate? It is absolutely not just being removed. OK, I missed that. however your code is still wrong. However you can't call free_pcppages_bulk() from interrupt context and then you can't use on_each_cpu() anyway. Given drain_pages() implementation, I find that hard to believe (It uses on_each_cpu_mask() and eventually calls free_pcppages_bulk()). Can you provide a reference backing up your statement? If this turns out to be an issue, schedule_on_each_cpu() could be an alternative. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
On 04/08/2013 10:28 AM, Cody P Schafer wrote: On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote: On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer wrote: In free_hot_cold_page(), we rely on pcp->batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. Change for_each_cpu() to on_each_cpu() to fix. Are you referring to this? - This was the case I noticed. 1329 if (pcp->count >= pcp->high) { 1330 free_pcppages_bulk(zone, pcp->batch, pcp); 1331 pcp->count -= pcp->batch; 1332 } I'm probably missing the obvious but won't it be simpler to do this in free_hot_cold_page() - 1329 if (pcp->count >= pcp->high) { 1330 unsigned int batch = ACCESS_ONCE(pcp->batch); 1331 free_pcppages_bulk(zone, batch, pcp); 1332 pcp->count -= batch; 1333 } Potentially, yes. Note that this was simply the one case I noticed, rather than certainly the only case. I also wonder whether there could be unexpected interactions between ->high and ->batch not changing together atomically. For example, could adjusting this knob cause ->batch to rise enough that it is greater than the previous ->high? If the code above then runs with the previous ->high, ->count wouldn't be correct (checking this inside free_pcppages_bulk() might help on this one issue). Now the batch value used is stable and you don't have to IPI every CPU in the system just to change a config knob... Is this really considered an issue? I wouldn't have expected someone to adjust the config knob often enough (or even more than once) to cause problems. Of course as a "It'd be nice" thing, I completely agree. Would using schedule_on_each_cpu() instead of on_each_cpu() be an improvement, in your opinion? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch
On 04/08/2013 12:08 PM, KOSAKI Motohiro wrote: > (4/8/13 1:16 PM), Cody P Schafer wrote: >> On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote: >>> (4/5/13 4:33 PM), Cody P Schafer wrote: >>>> In one case while modifying the ->high and ->batch fields of per cpu >>>> pagesets >>>> we're unneededly using stop_machine() (patches 1 & 2), and in another we >>>> don't have any >>>> syncronization at all (patch 3). >>>> >>>> This patchset fixes both of them. >>>> >>>> Note that it results in a change to the behavior of zone_pcp_update(), >>>> which is >>>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the >>>> essential behavior (changing ->high and ->batch), and only eliminated >>>> unneeded >>>> actions (draining the per cpu pages), but this may not be the case. >>> >>> at least, memory hotplug need to drain. >> >> Could you explain why the drain is required here? From what I can tell, >> after the stop_machine() completes, the per cpu page sets could be >> repopulated at any point, making the combination of draining and >> modifying ->batch & ->high uneeded. > > Then, memory hotplug again and again try to drain. Moreover hotplug prevent > repopulation > by using MIGRATE_ISOLATE. > pcp never be page count == 0 and it prevent memory hot remove. zone_pcp_update() is not part of the drain retry loop, in the offline pages case, it is only called following success in "removal" (online pages also calls zone_pcp_update(), I don't see how it could benifit in any way from draining the per cpu pagesets). So I still don't see where the need for draining pages combined with modifying ->high and ->batch came from. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
On 04/08/2013 03:18 PM, KOSAKI Motohiro wrote: (4/8/13 3:49 PM), Cody P Schafer wrote:> If this turns out to be an issue, schedule_on_each_cpu() could be an alternative. no way. schedule_on_each_cpu() is more problematic and it should be removed in the future. schedule_on_each_cpu() can only be used when caller task don't have any lock. otherwise it may make deadlock. I wasn't aware of that. Just to be clear, the deadlock you're referring to isn't the same one refered to in commit b71ab8c2025caef8db719aa41af0ed735dc543cd Author: Tejun Heo Date: Tue Jun 29 10:07:14 2010 +0200 workqueue: increase max_active of keventd and kill current_is_keventd() and commit 65a64464349883891e21e74af16c05d6e1eeb4e9 Author: Andi Kleen Date: Wed Oct 14 06:22:47 2009 +0200 HWPOISON: Allow schedule_on_each_cpu() from keventd If you're referencing some other deadlock, could you please provide a link to the relevant discussion? (I'd really like to add a note to schedule_on_each_cpu()'s doc comment about it so others can avoid that pitfall). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
On 04/08/2013 11:06 PM, Gilad Ben-Yossef wrote: On Tue, Apr 9, 2013 at 9:03 AM, Gilad Ben-Yossef wrote: I also wonder whether there could be unexpected interactions between ->high and ->batch not changing together atomically. For example, could adjusting this knob cause ->batch to rise enough that it is greater than the previous ->high? If the code above then runs with the previous ->high, ->count wouldn't be correct (checking this inside free_pcppages_bulk() might help on this one issue). You are right, but that can be treated in setup_pagelist_highmark() e.g.: 3993 static void setup_pagelist_highmark(struct per_cpu_pageset *p, 3994 unsigned long high) 3995 { 3996 struct per_cpu_pages *pcp; unsigned int batch; 3997 3998 pcp = &p->pcp; /* We're about to mess with PCP in an non atomic fashion. Put an intermediate safe value of batch and make sure it is visible before any other change */ pcp->batch = 1UL; smb_mb(); 3999 pcp->high = high; and i think I missed another needed barrier here: smp_mb(); 4000 batch = max(1UL, high/4); 4001 if ((high/4) > (PAGE_SHIFT * 8)) 4002 batch = PAGE_SHIFT * 8; pcp->batch = batch; 4003 } Yep, that appears to work, provided no additional users of ->batch and ->high show up. It seems we'll also need some locking to prevent concurrent updaters, but that is relatively light weight. I'll roll up a new patchset that uses this methodology. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 00/10] mm: fixup changers of per cpu pageset's ->high and ->batch
"Problems" with the current code: 1. there is a lack of synchronization in setting ->high and ->batch in percpu_pagelist_fraction_sysctl_handler() 2. stop_machine() in zone_pcp_update() is unnecissary. 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero To fix: 1. add memory barriers, a safe ->batch value, and an update side mutex when updating ->high and ->batch 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1 3. factor out quite a few functions, and then call the appropriate one. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I'm rather certain that I've diserned (and preserved) the essential behavior (changing ->high and ->batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. Further note that the draining of pages that previously took place in zone_pcp_update() occured after repeated draining when attempting to offline a page, and after the offline has "succeeded". It appears that the draining was added to zone_pcp_update() to avoid refactoring setup_pageset() into 2 funtions. -- Changes since v1: - instead of using on_each_cpu(), use memory barriers (Gilad) and an update side mutex. - add "Problem" #3 above, and fix. - rename function to match naming style of similar function - move unrelated comment Cody P Schafer (10): mm/page_alloc: factor out setting of pcp->high and pcp->batch. mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high mm/page_alloc: insert memory barriers to allow async update of pcp batch and high mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine() mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() mm/page_alloc: relocate comment to be directly above code it refers to. mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() mm/page_alloc.c | 124 +--- 1 file changed, 73 insertions(+), 51 deletions(-) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 01/10] mm/page_alloc: factor out setting of pcp->high and pcp->batch.
Creates pageset_set_batch() for use in setup_pageset(). pageset_set_batch() imitates the functionality of setup_pagelist_highmark(), but uses the boot time (percpu_pagelist_fraction == 0) calculations for determining ->high based on ->batch. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fcced7..5877cf0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4004,6 +4004,14 @@ static int __meminit zone_batchsize(struct zone *zone) #endif } +/* a companion to setup_pagelist_highmark() */ +static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) +{ + struct per_cpu_pages *pcp = &p->pcp; + pcp->high = 6 * batch; + pcp->batch = max(1UL, 1 * batch); +} + static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) { struct per_cpu_pages *pcp; @@ -4013,8 +4021,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) pcp = &p->pcp; pcp->count = 0; - pcp->high = 6 * batch; - pcp->batch = max(1UL, 1 * batch); + pageset_set_batch(p, batch); for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) INIT_LIST_HEAD(&pcp->lists[migratetype]); } @@ -4023,7 +4030,6 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist * to the value high for the pageset p. */ - static void setup_pagelist_highmark(struct per_cpu_pageset *p, unsigned long high) { -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 07/10] mm/page_alloc: relocate comment to be directly above code it refers to.
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2f2207..6e52e67 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3680,12 +3680,12 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) mminit_verify_zonelist(); cpuset_init_current_mems_allowed(); } else { - /* we have to stop all cpus to guarantee there is no user - of zonelist */ #ifdef CONFIG_MEMORY_HOTPLUG if (zone) setup_zone_pageset(zone); #endif + /* we have to stop all cpus to guarantee there is no user + of zonelist */ stop_machine(__build_all_zonelists, pgdat, NULL); /* cpuset refresh routine should be here */ } -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 10/10] mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 334387e..66b8bc2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4018,7 +4018,7 @@ static void pageset_update_prep(struct per_cpu_pages *pcp) smp_wmb(); } -/* a companion to setup_pagelist_highmark() */ +/* a companion to pageset_set_high() */ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) { struct per_cpu_pages *pcp = &p->pcp; @@ -4050,10 +4050,10 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) } /* - * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist + * pageset_set_high() sets the high water mark for hot per_cpu_pagelist * to the value high for the pageset p. */ -static void setup_pagelist_highmark(struct per_cpu_pageset *p, +static void pageset_set_high(struct per_cpu_pageset *p, unsigned long high) { struct per_cpu_pages *pcp; @@ -4075,7 +4075,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu) pageset_init(pcp); if (percpu_pagelist_fraction) - setup_pagelist_highmark(pcp, + pageset_set_high(pcp, (zone->managed_pages / percpu_pagelist_fraction)); else @@ -5526,8 +5526,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, unsigned long high; high = zone->managed_pages / percpu_pagelist_fraction; for_each_possible_cpu(cpu) - setup_pagelist_highmark( - per_cpu_ptr(zone->pageset, cpu), high); + pageset_set_high(per_cpu_ptr(zone->pageset, cpu), +high); } mutex_unlock(&pcp_batch_high_lock); return 0; -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 08/10] mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 27 +++ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e52e67..c663e62 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4069,22 +4069,25 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p, pcp->batch = PAGE_SHIFT * 8; } +static void __meminit zone_pageset_init(struct zone *zone, int cpu) +{ + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu); + + pageset_init(pcp); + if (percpu_pagelist_fraction) + setup_pagelist_highmark(pcp, + (zone->managed_pages / + percpu_pagelist_fraction)); + else + pageset_set_batch(pcp, zone_batchsize(zone)); +} + static void __meminit setup_zone_pageset(struct zone *zone) { int cpu; - zone->pageset = alloc_percpu(struct per_cpu_pageset); - - for_each_possible_cpu(cpu) { - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu); - - setup_pageset(pcp, zone_batchsize(zone)); - - if (percpu_pagelist_fraction) - setup_pagelist_highmark(pcp, - (zone->managed_pages / - percpu_pagelist_fraction)); - } + for_each_possible_cpu(cpu) + zone_pageset_init(zone, cpu); } /* -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 06/10] mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()
Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 50a277a..a2f2207 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4030,7 +4030,7 @@ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) pcp->batch = max(1UL, 1 * batch); } -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) +static void pageset_init(struct per_cpu_pageset *p) { struct per_cpu_pages *pcp; int migratetype; @@ -4039,11 +4039,16 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) pcp = &p->pcp; pcp->count = 0; - pageset_set_batch(p, batch); for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) INIT_LIST_HEAD(&pcp->lists[migratetype]); } +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) +{ + pageset_init(p); + pageset_set_batch(p, batch); +} + /* * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist * to the value high for the pageset p. -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 09/10] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
Previously, zone_pcp_update() called pageset_set_batch() directly, essentially assuming that percpu_pagelist_fraction == 0. Correct this by calling zone_pageset_init(), which chooses the appropriate ->batch and ->high calculations. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c663e62..334387e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6025,11 +6025,9 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages) void __meminit zone_pcp_update(struct zone *zone) { unsigned cpu; - unsigned long batch; mutex_lock(&pcp_batch_high_lock); - batch = zone_batchsize(zone); for_each_possible_cpu(cpu) - pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch); + zone_pageset_init(zone, cpu); mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 05/10] mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high
Simply moves calculation of the new 'high' value outside the for_each_possible_cpu() loop, as it does not depend on the cpu. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4a03c56..50a277a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5502,7 +5502,6 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write, * cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist * can have before it gets flushed back to buddy allocator. */ - int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { @@ -5516,12 +5515,11 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, mutex_lock(&pcp_batch_high_lock); for_each_populated_zone(zone) { - for_each_possible_cpu(cpu) { - unsigned long high; - high = zone->managed_pages / percpu_pagelist_fraction; + unsigned long high; + high = zone->managed_pages / percpu_pagelist_fraction; + for_each_possible_cpu(cpu) setup_pagelist_highmark( - per_cpu_ptr(zone->pageset, cpu), high); - } + per_cpu_ptr(zone->pageset, cpu), high); } mutex_unlock(&pcp_batch_high_lock); return 0; -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 02/10] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
Because we are going to rely upon a careful transision between old and new ->high and ->batch values using memory barriers and will remove stop_machine(), we need to prevent multiple updaters from interweaving their memory writes. Add a simple mutex to protect both update loops. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5877cf0..d259599 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -64,6 +64,9 @@ #include #include "internal.h" +/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ +static DEFINE_MUTEX(pcp_batch_high_lock); + #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); @@ -5491,6 +5494,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, ret = proc_dointvec_minmax(table, write, buffer, length, ppos); if (!write || (ret < 0)) return ret; + + mutex_lock(&pcp_batch_high_lock); for_each_populated_zone(zone) { for_each_possible_cpu(cpu) { unsigned long high; @@ -5499,6 +5504,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, per_cpu_ptr(zone->pageset, cpu), high); } } + mutex_unlock(&pcp_batch_high_lock); return 0; } @@ -6012,7 +6018,9 @@ static int __meminit __zone_pcp_update(void *data) void __meminit zone_pcp_update(struct zone *zone) { + mutex_lock(&pcp_batch_high_lock); stop_machine(__zone_pcp_update, zone, NULL); + mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 03/10] mm/page_alloc: insert memory barriers to allow async update of pcp batch and high
In pageset_set_batch() and setup_pagelist_highmark(), ensure that batch is always set to a safe value (1) prior to updating high, and ensure that high is fully updated before setting the real value of batch. Suggested by Gilad Ben-Yossef in this thread: https://lkml.org/lkml/2013/4/9/23 Also reproduces his proposed comment. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d259599..a07bd4c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4007,11 +4007,26 @@ static int __meminit zone_batchsize(struct zone *zone) #endif } +static void pageset_update_prep(struct per_cpu_pages *pcp) +{ + /* +* We're about to mess with PCP in an non atomic fashion. Put an +* intermediate safe value of batch and make sure it is visible before +* any other change +*/ + pcp->batch = 1; + smp_wmb(); +} + /* a companion to setup_pagelist_highmark() */ static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) { struct per_cpu_pages *pcp = &p->pcp; + pageset_update_prep(pcp); + pcp->high = 6 * batch; + smp_wmb(); + pcp->batch = max(1UL, 1 * batch); } @@ -4039,7 +4054,11 @@ static void setup_pagelist_highmark(struct per_cpu_pageset *p, struct per_cpu_pages *pcp; pcp = &p->pcp; + pageset_update_prep(pcp); + pcp->high = high; + smp_wmb(); + pcp->batch = max(1UL, high/4); if ((high/4) > (PAGE_SHIFT * 8)) pcp->batch = PAGE_SHIFT * 8; -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 04/10] mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()
zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a percpu pageset based on a zone's ->managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. This lets us avoid calling setup_pageset() (and the draining required to call it) and instead allows simply setting the fields' values (with some attention paid to memory barriers to prevent the relationship between ->batch and ->high from being thrown off). This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 33 + 1 file changed, 9 insertions(+), 24 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a07bd4c..4a03c56 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6012,33 +6012,18 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages) #endif #ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit __zone_pcp_update(void *data) -{ - struct zone *zone = data; - int cpu; - unsigned long batch = zone_batchsize(zone), flags; - - for_each_possible_cpu(cpu) { - struct per_cpu_pageset *pset; - struct per_cpu_pages *pcp; - - pset = per_cpu_ptr(zone->pageset, cpu); - pcp = &pset->pcp; - - local_irq_save(flags); - if (pcp->count > 0) - free_pcppages_bulk(zone, pcp->count, pcp); - drain_zonestat(zone, pset); - setup_pageset(pset, batch); - local_irq_restore(flags); - } - return 0; -} - +/* + * The zone indicated has a new number of managed_pages; batch sizes and percpu + * page high values need to be recalulated. + */ void __meminit zone_pcp_update(struct zone *zone) { + unsigned cpu; + unsigned long batch; mutex_lock(&pcp_batch_high_lock); - stop_machine(__zone_pcp_update, zone, NULL); + batch = zone_batchsize(zone); + for_each_possible_cpu(cpu) + pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch); mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 02/11] mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
Because we are going to rely upon a careful transision between old and new ->high and ->batch values using memory barriers and will remove stop_machine(), we need to prevent multiple updaters from interweaving their memory writes. Add a simple mutex to protect both update loops. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5877cf0..d259599 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -64,6 +64,9 @@ #include #include "internal.h" +/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ +static DEFINE_MUTEX(pcp_batch_high_lock); + #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); @@ -5491,6 +5494,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, ret = proc_dointvec_minmax(table, write, buffer, length, ppos); if (!write || (ret < 0)) return ret; + + mutex_lock(&pcp_batch_high_lock); for_each_populated_zone(zone) { for_each_possible_cpu(cpu) { unsigned long high; @@ -5499,6 +5504,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write, per_cpu_ptr(zone->pageset, cpu), high); } } + mutex_unlock(&pcp_batch_high_lock); return 0; } @@ -6012,7 +6018,9 @@ static int __meminit __zone_pcp_update(void *data) void __meminit zone_pcp_update(struct zone *zone) { + mutex_lock(&pcp_batch_high_lock); stop_machine(__zone_pcp_update, zone, NULL); + mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 00/11] mm: fixup changers of per cpu pageset's ->high and ->batch
"Problems" with the current code: 1. there is a lack of synchronization in setting ->high and ->batch in percpu_pagelist_fraction_sysctl_handler() 2. stop_machine() in zone_pcp_update() is unnecissary. 3. zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero To fix: 1. add memory barriers, a safe ->batch value, an update side mutex when updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that expect a stable value. 2. avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1 3. factor out quite a few functions, and then call the appropriate one. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I'm rather certain that I've diserned (and preserved) the essential behavior (changing ->high and ->batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. Further note that the draining of pages that previously took place in zone_pcp_update() occured after repeated draining when attempting to offline a page, and after the offline has "succeeded". It appears that the draining was added to zone_pcp_update() to avoid refactoring setup_pageset() into 2 funtions. -- Since v2: https://lkml.org/lkml/2013/4/9/718 - note ACCESS_ONCE() in fix #1 above. - consolidate ->batch & ->high update protocol into a single funtion (Gilad). - add missing ACCESS_ONCE() on ->batch Since v1: https://lkml.org/lkml/2013/4/5/444 - instead of using on_each_cpu(), use memory barriers (Gilad) and an update side mutex. - add "Problem" #3 above, and fix. - rename function to match naming style of similar function - move unrelated comment -- Cody P Schafer (11): mm/page_alloc: factor out setting of pcp->high and pcp->batch. mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high mm/page_alloc: insert memory barriers to allow async update of pcp batch and high mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine() mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() mm/page_alloc: relocate comment to be directly above code it refers to. mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() mm/page_alloc.c | 151 +--- 1 file changed, 90 insertions(+), 61 deletions(-) -- 1.8.2.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 10/11] mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
Previously, zone_pcp_update() called pageset_set_batch() directly, essentially assuming that percpu_pagelist_fraction == 0. Correct this by calling zone_pageset_init(), which chooses the appropriate ->batch and ->high calculations. Signed-off-by: Cody P Schafer --- mm/page_alloc.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 749b6e1..5ee5ce9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6032,11 +6032,9 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages) void __meminit zone_pcp_update(struct zone *zone) { unsigned cpu; - unsigned long batch; mutex_lock(&pcp_batch_high_lock); - batch = zone_batchsize(zone); for_each_possible_cpu(cpu) - pageset_set_batch(per_cpu_ptr(zone->pageset, cpu), batch); + zone_pageset_init(zone, cpu); mutex_unlock(&pcp_batch_high_lock); } #endif -- 1.8.2.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/