from:"Christoph Lameter"

Re: [PATCH] nohz1: Documentation

2013-03-21 Thread Christoph Lameter

On Wed, 20 Mar 2013, Paul E. McKenney wrote:

> > > Another approach is to offload RCU callback processing to "rcuo" kthreads
> > > using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> > > selected via several methods:

Why are there multiple rcuo threads? Would a single thread that may be
able to run on multiple cpus not be sufficient?

> > "Even though the SCHED_FIFO task is the only task running, because the
> > SCHED_OTHER tasks are queued on the CPU, it currently will not enter
> > adaptive tick mode."
>
> Again, good point!

Uggh. That will cause problems and did cause problems when I tried to use
nohz.

The OS always has some sched other tasks around that become runnable after
a while (like for example the vm statistics update, or the notorious slab
scanning). As long as SCHED_FIFO is active and there is no process in the
same scheduling class then tick needs to be off. Also wish that this would
work with SCHED_OTHER if there is only a single task with a certain renice
value (-10?) and the rest is runnable at lower priorities. Maybe in that
case stop the tick for a longer period and then give the lower priority
tasks a chance to run but then switch off the tick again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: next-20130204 - bisected slab problem to "slab: Common constants for kmalloc boundaries"

2013-02-04 Thread Christoph Lameter

On Mon, 4 Feb 2013, James Hogan wrote:

> I've hit boot problems in next-20130204 on Meta:

Meta is an arch that is not in the tree yet? How would I build for meta?

What are the values of

MAX_ORDER
PAGE_SHIFT
ARCH_DMA_MINALIGN
CONFIG_ZONE_DMA

?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: next-20130204 - bisected slab problem to "slab: Common constants for kmalloc boundaries"

2013-02-04 Thread Christoph Lameter

On Mon, 4 Feb 2013, Stephen Warren wrote:

> Here, if defined(ARCH_DMA_MINALIGN), then KMALLOC_MIN_SIZE isn't
> relative-to/derived-from KMALLOC_SHIFT_LOW, so the two may become
> inconsistent.

Right. And kmalloc_index() will therefore return KMALLOC_SHIFT_LOW which
will dereference a NULL pointer since only the later cache pointers are
populated. KMALLOC_SHIFT_LOW needs to be set correctly.

> > diff --git a/mm/slub.c b/mm/slub.c
> > index ba2ca53..d0f72ee 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2775,7 +2775,7 @@ init_kmem_cache_node(struct kmem_cache_node *n)
> >  static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
> >  {
> > BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
> > -   SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
> > +   KMALLOC_SHIFT_HIGH * sizeof(struct kmem_cache_cpu));
>
> Should that also be (KMALLOC_SHIFT_HIGH + 1)?

That is already a pretty fuzzy test. The nr of kmem_cache_cpu allocated is
lower than KMALLOC_SHIFT_HIGH since several index positions will not be
occupied.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slob: Check for NULL pointer before calling ctor()

2013-02-05 Thread Christoph Lameter

On Tue, 5 Feb 2013, Steven Rostedt wrote:

> Ping?

Obviously correct.

Acked-by: Christoph Lameter 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: next-20130204 - bisected slab problem to "slab: Common constants for kmalloc boundaries"

2013-02-05 Thread Christoph Lameter

OK I was able to reproduce it by setting ARCH_DMA_MINALIGN in slab.h. This
patch fixes it here:


Subject: slab: Handle ARCH_DMA_MINALIGN correctly

A fixed KMALLOC_SHIFT_LOW does not work for arches with higher alignment
requirements.

Determine KMALLOC_SHIFT_LOW from ARCH_DMA_MINALIGN instead.

Signed-off-by: Christoph Lameter 

Index: linux/include/linux/slab.h
===
--- linux.orig/include/linux/slab.h 2013-02-05 10:30:53.917724146 -0600
+++ linux/include/linux/slab.h  2013-02-05 10:31:01.181836707 -0600
@@ -133,6 +133,19 @@ void kfree(const void *);
 void kzfree(const void *);
 size_t ksize(const void *);

+/*
+ * Some archs want to perform DMA into kmalloc caches and need a guaranteed
+ * alignment larger than the alignment of a 64-bit integer.
+ * Setting ARCH_KMALLOC_MINALIGN in arch headers allows that.
+ */
+#if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8
+#define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN
+#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN
+#define KMALLOC_SHIFT_LOW ilog2(ARCH_DMA_MINALIGN)
+#else
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
 #ifdef CONFIG_SLOB
 /*
  * Common fields provided in kmem_cache by all slab allocators
@@ -179,7 +192,9 @@ struct kmem_cache {
 #define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
(MAX_ORDER + PAGE_SHIFT - 1) : 25)
 #define KMALLOC_SHIFT_MAX  KMALLOC_SHIFT_HIGH
+#ifndef KMALLOC_SHIFT_LOW
 #define KMALLOC_SHIFT_LOW  5
+#endif
 #else
 /*
  * SLUB allocates up to order 2 pages directly and otherwise
@@ -187,8 +202,10 @@ struct kmem_cache {
  */
 #define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
 #define KMALLOC_SHIFT_MAX  (MAX_ORDER + PAGE_SHIFT)
+#ifndef KMALLOC_SHIFT_LOW
 #define KMALLOC_SHIFT_LOW  3
 #endif
+#endif

 /* Maximum allocatable size */
 #define KMALLOC_MAX_SIZE   (1UL << KMALLOC_SHIFT_MAX)
@@ -200,9 +217,7 @@ struct kmem_cache {
 /*
  * Kmalloc subsystem.
  */
-#if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8
-#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN
-#else
+#ifndef KMALLOC_MIN_SIZE
 #define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW)
 #endif

@@ -285,17 +300,6 @@ static __always_inline int kmalloc_size(
 #endif /* !CONFIG_SLOB */

 /*
- * Some archs want to perform DMA into kmalloc caches and need a guaranteed
- * alignment larger than the alignment of a 64-bit integer.
- * Setting ARCH_KMALLOC_MINALIGN in arch headers allows that.
- */
-#ifdef ARCH_DMA_MINALIGN
-#define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN
-#else
-#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
-#endif
-
-/*
  * Setting ARCH_SLAB_MINALIGN in arch headers allows a different alignment.
  * Intended for arches that get misalignment faults even for 64 bit integer
  * aligned buffers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: next-20130204 - bisected slab problem to "slab: Common constants for kmalloc boundaries"

2013-02-05 Thread Christoph Lameter

On Tue, 5 Feb 2013, James Hogan wrote:

> On 05/02/13 16:36, Christoph Lameter wrote:
> > OK I was able to reproduce it by setting ARCH_DMA_MINALIGN in slab.h. This
> > patch fixes it here:
> >
> >
> > Subject: slab: Handle ARCH_DMA_MINALIGN correctly
> >
> > A fixed KMALLOC_SHIFT_LOW does not work for arches with higher alignment
> > requirements.
> >
> > Determine KMALLOC_SHIFT_LOW from ARCH_DMA_MINALIGN instead.
> >
> > Signed-off-by: Christoph Lameter 
>
> Thanks, your patch fixes it for me.

Ok I guess that implies a Tested-by:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: next-20130204 - bisected slab problem to "slab: Common constants for kmalloc boundaries"

2013-02-05 Thread Christoph Lameter

On Tue, 5 Feb 2013, Stephen Warren wrote:

> > +/*
> > + * Some archs want to perform DMA into kmalloc caches and need a guaranteed
> > + * alignment larger than the alignment of a 64-bit integer.
> > + * Setting ARCH_KMALLOC_MINALIGN in arch headers allows that.
> > + */
> > +#if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8
> > +#define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN
> > +#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN
>
> I might be tempted to drop that #define of KMALLOC_MIN_SIZE ...

Initially I thought so too.
>
> > +#define KMALLOC_SHIFT_LOW ilog2(ARCH_DMA_MINALIGN)
> > +#else
> > +#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
> > +#endif
>
> > +#ifndef KMALLOC_MIN_SIZE
> >  #define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW)
> >  #endif
>
> ... and simply drop the ifdef around that #define instead.

That is going to be one hell of a macro expansion.

> That way, KMALLOC_MIN_SIZE is always defined in one place, and derived
> from KMALLOC_SHIFT_LOW; the logic will just set KMALLOC_SHIFT_LOW based
> on the various conditions. This seems a little safer to me; fewer
> conditions and less code to update if anything changes.

Yeah but we do an ilog2 and then reverse this back to the original number.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] 3.8-rc6-nohz4

2013-02-07 Thread Christoph Lameter

On Thu, 7 Feb 2013, Ingo Molnar wrote:

> Agreed?

Yes and please also change the texts in Kconfig to accurately describe
what happens to the timer tick.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] 3.8-rc6-nohz4

2013-02-07 Thread Christoph Lameter

On Thu, 7 Feb 2013, Frederic Weisbecker wrote:

> Not with hrtick.

hrtick? Did we not already try that a couple of years back and it turned
out that the overhead of constantly reprogramming a timer via the PCI bus
was causing too much of a performance regression?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] 3.8-rc6-nohz4

2013-02-08 Thread Christoph Lameter

On Fri, 8 Feb 2013, Steven Rostedt wrote:

> On Fri, 2013-02-08 at 16:53 +0100, Frederic Weisbecker wrote:
> > 2013/2/7 Christoph Lameter :
> > > On Thu, 7 Feb 2013, Frederic Weisbecker wrote:
> > >
> > >> Not with hrtick.
> > >
> > > hrtick? Did we not already try that a couple of years back and it turned
> > > out that the overhead of constantly reprogramming a timer via the PCI bus
> > > was causing too much of a performance regression?
> >
> > Yeah Peter said that especially reprogramming the clock everytime we
> > call schedule() was killing the performances. Now may be on some
> > workloads, with the tick stopped, we can find some new results.
>
> I could imagine this being dynamic. If the system isn't very loaded, and
> the scheduler is giving lots of time slices to tasks, then perhaps it
> could switch to a reprogramming the clock based scheduling. Or maybe, we
> could switch to a "skip ticks" method. That is, instead of completely
> disabling the tick, make the tick go off every other time or less, and
> use the NO_HZ code to calculate the missed ticks.

Ok that sounds good. Automatically reducing the HZ as much as possible
would also quiet down the OS and be beneficial for low latency tasks. We
are configuring the kernels here with the lowest HZ that the hardware
allows to reduce the number of events that impact on the app. The main
problem is that the network stack becomes flaky at low HZ.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] 3.8-rc6-nohz4

2013-02-08 Thread Christoph Lameter

On Fri, 8 Feb 2013, Clark Williams wrote:

> I was a little apprehensive when you started talking about multiple
> tasks in Adaptive NOHZ mode on a core but the more I started thinking
> about it, I realized that we might end up in a cooperative multitasking
> mode with no tick at all going. Multiple SCHED_FIFO threads could
> run until blocking and another would be picked. Depends on well
> behaved threads of course, so probably many cases of users shooting off
> some toes with this...
>
> Of course if you mix scheduling policies or have RT throttling turned
> on we'll need some sort of tick for preemption. But if we can keep the
> timer reprogramming down we may see some big wins for RT and HPC loads.

We could tune the (hr)timer tick to have the same interval as the time
slice interval for a process and make that constant for all processes on a
hardware thread?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/3] slub: correct to calculate num of acquired objects in get_partial_node()

2013-04-02 Thread Christoph Lameter

On Tue, 2 Apr 2013, Pekka Enberg wrote:

> On Tue, Mar 19, 2013 at 7:10 AM, Joonsoo Kim  wrote:
> > Could you pick up 1/3, 3/3?
> > These are already acked by Christoph.
> > 2/3 is same effect as Glauber's "slub: correctly bootstrap boot caches",
> > so should skip it.
>
> Applied, thanks!

Could you also put in

1. The fixes for the hotpath using preempt/enable/disable that were
discussed with the RT folks a couple of months ago.

2. The fixes from the slab next branch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-02 Thread Christoph Lameter

On Tue, 2 Apr 2013, Joonsoo Kim wrote:

> We need one more fix for correctness.
> When available is assigned by put_cpu_partial, it doesn't count cpu slab's 
> objects.
> Please reference my old patch.
>
> https://lkml.org/lkml/2013/1/21/64

Could you update your patch and submit it again?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

2013-04-03 Thread Christoph Lameter

On Tue, 2 Apr 2013, Hugh Dickins wrote:

> I am strongly in favour of removing that limitation from
> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> and Christoph were both in favour too); and note that there is no such
> restriction in the confusingly similar but different isolate_lru_page().

Well the naming could be cleaned up. The fundamental issue with migrating
pages is that all references have to be tracked and updates in a way that
no references can be followed to invalid or stale page contents. If ramfs
does not maintain separate pointers but only relies on pointers already
handled by the migration logic then migration is fine.

> Some people do worry that migrating Mlocked pages would introduce the
> occasional possibility of a minor fault (with migration_entry_wait())
> on an Mlocked region which never faulted before.  I tend to dismiss
> that worry, but maybe I'm wrong to do so: maybe there should be a
> tunable for realtimey people to set, to prohibit page migration from
> mlocked areas; but the default should be to allow it.

Could we have a different way of marking pages "pinned"? This is useful
for various subsystems (like RDMA and various graphics drivers etc) which
need to ensure that virtual address to physical address mappings stay the
same for a prolonged period of time. I think this use case is becoming
more frequent given that offload techniques have to be used these days to
overcome the limits on processor performance.

> The other reason it looks as if ramfs pages cannot be migrated, is
> that it does not set a suitable ->migratepage method, so would be
> handled by fallback_migrate_page(), whose PageDirty test will end
> up failing the migration with -EBUSY or -EINVAL - if I read it
> correctly.

These could be handled the same way that anonymous pages are handled.

> But until ramfs pages can be migrated, they should not be allocated
> with __GFP_MOVABLE.  (I've been writing about the migratability of
> small pages: I expect you have the migratability of THPages in flux.)

I agree.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-04 Thread Christoph Lameter

On Thu, 4 Apr 2013, Joonsoo Kim wrote:

> Pekka alreay applied it.
> Do we need update?

Well I thought the passing of the count via lru.next would be something
worthwhile to pick up.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC GIT PULL] nohz: Kconfig layout improvements

2013-04-04 Thread Christoph Lameter

It seems that nohz still has no effect.

3.9-rc5 + patches. Affinity of init set to 0,1 so no
tasks are running on 9. The "latencytest" used here is part of my
lldiag-0.15 toolkit.

First test without any special kernel parameters. nohz off right?

$ nice -5 taskset -c 9 latencytest

CPUs: Freq=2.90Ghz Processors=32 Cores=8 cacheline_size=64 Intel(R)
Xeon(R) CPU E5-2690 0 @ 2.90GHz
16775106 samples below 1000 nsec
13 involuntary context switches
1019 (0.00607411%) variances in 10.00 seconds: minimum 1.07us maximum 12.32us 
average 3.30us stddev 0.63us

HZ=100 so the 1019 variances are likely timer interrupts.




After nohz setup

/proc/cmdline:

BOOT_IMAGE=/vmlinuz-3.9.0-rc5+ root=/dev/mapper/vg01-root ro console=tty0 
console=ttyS0,115200 idle=mwait rcu_nocb_poll rcu_nocbs=2-31 nohz_extended=2-31

$ nice -5 taskset -c 9 latencytest
CPUs: Freq=2.90Ghz Processors=32 Cores=8 cacheline_size=64 Intel(R)
Xeon(R) CPU E5-2690 0 @ 2.90GHz
16779362 samples below 1000 nsec
13 involuntary context switches
1037 (0.00617983%) variances in 10.00 seconds: minimum 1.00us maximum 10.61us 
average 3.30us stddev 0.98us



If I move the RCU threads off the cpu then I get a slightly better result:

$ nice -5 taskset -c 9 latencytest
CPUs: Freq=2.90Ghz Processors=32 Cores=8 cacheline_size=64 Intel(R)
Xeon(R) CPU E5-2690 0 @ 2.90GHz
16796039 samples below 1000 nsec
12 involuntary context switches
1020 (0.00607249%) variances in 10.00 seconds: minimum 1.00us maximum 11.58us 
average 2.77us stddev 0.55us



Why is the tick not stopping? How do I diagnose this? (I can start
patching the kernel again like last time but isnt there a better way?)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC GIT PULL] nohz: Kconfig layout improvements

2013-04-04 Thread Christoph Lameter

On Thu, 4 Apr 2013, Gilad Ben-Yossef wrote:

> Here is the last version I posted over a year ago. You were CCed and
> provided very useful feedback:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/01291.html

Ah. yes I remember now.

> Based on your feedback I re-spun them but never gotten around to port
> them to recent kernel and post them. The latest (yet unpublished
> versions) are here:
>
> https://github.com/gby/linux/commit/04e041327036772383c14ebdc450f522d782f264
> https://github.com/gby/linux/commit/f5b8ae815670a289af9ff22ad83da3295472e63c
>
> I'll try to resurrect them, but maybe they'll prove useful as a
> reference as they are in the meantime.

I think that would be very useful.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

2013-04-05 Thread Christoph Lameter

On Fri, 5 Apr 2013, Minchan Kim wrote:

> > >> How about add a knob?
> > >
> > >Maybe, volunteering?
> >
> > Hi Minchan,
> >
> > I can be the volunteer, what I care is if add a knob make sense?
>
> Frankly sepaking, I'd like to avoid new knob but there might be
> some workloads suffered from mlocked page migration so we coudn't
> dismiss it. In such case, introducing the knob would be a solution
> with default enabling. If we don't have any report for a long time,
> we can remove the knob someday, IMHO.

No Knob please. A new implementation for page pinning that avoids the
mlock crap.

1. It should be available for device drivers to pin their memory (they are
now elevating the ref counter which means page migration will have to see
if it can account for all references before giving up and it does that
quite frequently). So there needs to be an in kernel API, a syscall API as
well as a command line one. Preferably as similar as possible.

2. A sane API for marking pages as mlocked. Maybe part of MMAP? I hate the
command line tools and the APIs for doing that right now.

3. The reservation scheme for mlock via ulimit is broken. We have per
process constraints only it seems. If you start enough processes you can
still make the kernel go OOM.

4. mlock semantics are prescribed by posix which states that the page
stays in memory. I think we should stay with that narrow definition for
mlock.

5. Pinning could also mean that page faults on the page are to be avoided.
COW could occur on fork and page table entries could be instantated at
mmap/fork time. Pinning could mean that minor/major faults will not occur
on a page.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-05 Thread Christoph Lameter

On Fri, 5 Apr 2013, Joonsoo Kim wrote:

> Here goes a patch implementing Christoph's idea.
> Instead of updating my previous patch, I re-write this patch on top of
> your slab/next tree.

Acked-by: Christoph Lameter 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 23/32] Generic dynamic per cpu refcounting

2013-01-28 Thread Christoph Lameter

On Mon, 28 Jan 2013, Kent Overstreet wrote:

> > It goes down to how we allocate page tables.  percpu depends on
> > vmalloc space allocation which in turn depends on page table
> > allocation which unfortunately assumes GFP_KERNEL and is spread all
> > across different architectures.  Adding @gfp to it came up a couple
> > times but the cases weren't strong enough to push it all the way
> > through.  There are some aspects that I like about forcing GFP_KERNEL
> > on all percpu allocations but if there are strong enough cases and
> > someone is willing enough to push it through, maybe.
>
> Ahh, thanks for explaining, was curious about that.

I think its good not to allocate percpu memory in hot paths. Otherwise the
percpu allocator would become much more complex due to locking constraints
of all those hot paths (tried that in the slab allocators once which
ended up in a multi year issue with locking). It is usually possible to
allocate the percpu areas when the struct they belong to is allocated.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] cputime: Full dynticks task/cputime accounting v7

2013-01-28 Thread Christoph Lameter

On Mon, 28 Jan 2013, Frederic Weisbecker wrote:

> My last concern is the dependency on CONFIG_64BIT. We rely on cputime_t
> being u64 for reasonable nanosec granularity implementation. And therefore
> we need a single instruction fetch to read kernel cpustat for atomicity
> requirement against concurrent incrementation, which only 64 bit archs
> can provide.

Most x86 cpus support cmpxchg8b on 32bit which can be abused for
64 bit reads (see cmpxchg64 in cmpxchg_32.h). Simply do a cmpxchg with
zero and use whatever it returns.

A percpu version that uses the instruction is called
this_cpu_cmpxchg_double().

> There is just no emergency though as this new option depends on the context
> tracking subsystem that only x86-64 (and soon ppc64) implements yet. And
> this set is complex enough already. I think we can deal with that later.

Ok then this may not be that useful.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] cputime: Full dynticks task/cputime accounting v7

2013-01-28 Thread Christoph Lameter

On Mon, 28 Jan 2013, Frederic Weisbecker wrote:

> 2013/1/28 Christoph Lameter :
> > On Mon, 28 Jan 2013, Frederic Weisbecker wrote:
> >
> >> My last concern is the dependency on CONFIG_64BIT. We rely on cputime_t
> >> being u64 for reasonable nanosec granularity implementation. And therefore
> >> we need a single instruction fetch to read kernel cpustat for atomicity
> >> requirement against concurrent incrementation, which only 64 bit archs
> >> can provide.
> >
> > Most x86 cpus support cmpxchg8b on 32bit which can be abused for
> > 64 bit reads (see cmpxchg64 in cmpxchg_32.h). Simply do a cmpxchg with
> > zero and use whatever it returns.
> >
> > A percpu version that uses the instruction is called
> > this_cpu_cmpxchg_double().
>
> Yeah but we need to be able to do remote read. In fact atomic_read()
> would do the trick.

Well yes, cmpxchg64() could do the same without the need for an atomic
variable.

> >> There is just no emergency though as this new option depends on the context
> >> tracking subsystem that only x86-64 (and soon ppc64) implements yet. And
> >> this set is complex enough already. I think we can deal with that later.
> >
> > Ok then this may not be that useful.
>
> What is not useful?

The information about the 64 bit reads that I posted.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-23 Thread Christoph Lameter

On Sat, 23 Feb 2013, JoonSoo Kim wrote:

> With flushing, deactivate_slab() occur and it has some overhead to
> deactivate objects.
> If my patch properly fix this situation, it is better to use mine
> which has no overhead.

Well this occurs during boot and its not that performance critical.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] mm: slab: Verify the nodeid passed to ____cache_alloc_node

2013-02-25 Thread Christoph Lameter

On Mon, 25 Feb 2013, Rik van Riel wrote:

> On 02/25/2013 12:18 PM, Aaron Tomlin wrote:
>
> > mm: slab: Verify the nodeid passed to cache_alloc_node
> >
> > If the nodeid is > num_online_nodes() this can cause an
> > Oops and a panic(). The purpose of this patch is to assert
> > if this condition is true to aid debugging efforts rather
> > than some random NULL pointer dereference or page fault.
> >
> > Signed-off-by: Aaron Tomlin 
>
> Reviewed-by: Rik van Riel 

It may be helpful to cc the slab maintainers...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-27 Thread Christoph Lameter

On Wed, 27 Feb 2013, Glauber Costa wrote:

> You can apply this one as-is with Christoph's ACK.

Right.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel BUG at mm/slub.c:3409, 3.8.0-rc7

2013-02-18 Thread Christoph Lameter

The problem is that the subsystem attempted to call kfree with a pointer
that was not obtained via a slab allocation.

On Sat, 16 Feb 2013, Denys Fedoryshchenko wrote:

> Hi
>
> Worked for a while on 3.8.0-rc7, generally it is fine, then suddenly laptop
> stopped responding to keyboard and mouse.
> Sure it can be memory corruption by some other module, but maybe not. Worth to
> report i guess.
> After reboot checked logs and found this:
>
> Feb 16 00:40:17 localhost kernel: [23260.079253] [ cut here
> ]
> Feb 16 00:40:17 localhost kernel: [23260.079257] kernel BUG at mm/slub.c:3409!
> Feb 16 00:40:17 localhost kernel: [23260.079259] invalid opcode:  [#1] SMP
> Feb 16 00:40:17 localhost kernel: [23260.079262] Modules linked in:
> ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter
> ip_tables tun bridge stp llc nouveau snd_hda_codec_hdmi coretemp kvm_intel
> snd_hda_codec_realtek uvcvideo videobuf2_vmalloc snd_hda_intel
> videobuf2_memops videobuf2_core kvm mxm_wmi wmi videodev hwmon ttm
> snd_hda_codec drm_kms_helper rtl8192se rtlwifi nvidiafb mei lpc_ich mfd_core
> i2c_i801 snd_hwdep
> Feb 16 00:40:17 localhost kernel: [23260.079295] CPU 0
> Feb 16 00:40:17 localhost kernel: [23260.079298] Pid: 3811, comm: kworker/0:1
> Tainted: GW3.8.0-rc7-lap #1 TOSHIBA Satellite A665/NWQAA
> Feb 16 00:40:17 localhost kernel: [23260.079300] RIP:
> 0010:[]  [] kfree+0x31/0xb1
> Feb 16 00:40:17 localhost kernel: [23260.079306] RSP: 0018:88012b02fd28
> EFLAGS: 00010246
> Feb 16 00:40:17 localhost kernel: [23260.079308] RAX: 8000 RBX:
> 8801029fcb40 RCX: 0079b6df
> Feb 16 00:40:17 localhost kernel: [23260.079310] RDX: 8000 RSI:
> 88017d79f480 RDI: 8801029fcb40
> Feb 16 00:40:17 localhost kernel: [23260.079312] RBP: 88012b02fd48 R08:
> 00014b60 R09: 0001
> Feb 16 00:40:17 localhost kernel: [23260.079313] R10: 8801 R11:
> 0001 R12: ea00040a7f00
> Feb 16 00:40:17 localhost kernel: [23260.079315] R13: 8801bfc15e00 R14:
> 8801bfc0d380 R15: 88012b02fda8
> Feb 16 00:40:17 localhost kernel: [23260.079317] FS:  ()
> GS:8801bfc0() knlGS:
> Feb 16 00:40:17 localhost kernel: [23260.079319] CS:  0010 DS:  ES: 
> CR0: 8005003b
> Feb 16 00:40:17 localhost kernel: [23260.079321] CR2: 35ffe6f63008 CR3:
> 01a0c000 CR4: 07f0
> Feb 16 00:40:17 localhost kernel: [23260.079322] DR0:  DR1:
>  DR2: 
> Feb 16 00:40:17 localhost kernel: [23260.079324] DR3:  DR6:
> 0ff0 DR7: 0400
> Feb 16 00:40:17 localhost kernel: [23260.079326] Process kworker/0:1 (pid:
> 3811, threadinfo 88012b02e000, task 8801b030)
> Feb 16 00:40:17 localhost kernel: [23260.079327] Stack:
> Feb 16 00:40:17 localhost kernel: [23260.079329]  8801019fcb50
> 8801029fcb40 8801bfc15e00 8801bfc0d380
> Feb 16 00:40:17 localhost kernel: [23260.079332]  88012b02fd68
> 8125fe88 880165c0a600 8801019fcb50
> Feb 16 00:40:17 localhost kernel: [23260.079336]  88012b02fdf8
> 810441f0 81044183 88012b02ffd8
> Feb 16 00:40:17 localhost kernel: [23260.079339] Call Trace:
> Feb 16 00:40:17 localhost kernel: [23260.079344]  []
> acpi_os_execute_deferred+0x2a/0x2f
> Feb 16 00:40:17 localhost kernel: [23260.079348]  []
> process_one_work+0x1d8/0x2eb
> Feb 16 00:40:17 localhost kernel: [23260.079351]  [] ?
> process_one_work+0x16b/0x2eb
> Feb 16 00:40:17 localhost kernel: [23260.079354]  [] ?
> acpi_os_wait_events_complete+0x1e/0x1e
> Feb 16 00:40:17 localhost kernel: [23260.079357]  []
> worker_thread+0x13e/0x1c1
> Feb 16 00:40:17 localhost kernel: [23260.079360]  [] ?
> manage_workers+0x250/0x250
> Feb 16 00:40:17 localhost kernel: [23260.079363]  []
> kthread+0xa5/0xad
> Feb 16 00:40:17 localhost kernel: [23260.079366]  [] ?
> __init_kthread_worker+0x56/0x56
> Feb 16 00:40:17 localhost kernel: [23260.079370]  []
> ret_from_fork+0x7c/0xb0
> Feb 16 00:40:17 localhost kernel: [23260.079373]  [] ?
> __init_kthread_worker+0x56/0x56
> Feb 16 00:40:17 localhost kernel: [23260.079374] Code: 89 e5 41 56 41 55 41 54
> 53 48 89 fb 0f 86 90 00 00 00 e8 74 e1 ff ff 49 89 c4 48 8b 00 a8 80 75 20 49
> f7 04 24 00 c0 00 00 75 02 <0f> 0b 4c 89 e7 e8 44 d0 ff ff 4c 89 e7 89 c6 e8
> 7d 2d fd ff eb
> Feb 16 00:40:17 localhost kernel: [23260.079409] RIP  []
> kfree+0x31/0xb1
> Feb 16 00:40:17 localhost kernel: [23260.079412]  RSP 
> Feb 16 00:40:17 localhost kernel: [23260.079414] ---[ end trace
> bae1313833245122 ]---
> Feb 16 00:40:17 localhost kernel: [23260.079450] BUG: unable to handle kernel
> paging request at ffa8
> Feb 16 00:40:17 localhost kernel: [23260.079452] IP: []
> kthread_data+0xb/0x11
> Feb 16 00:40

Re: slab: odd BUG on kzalloc

2013-02-18 Thread Christoph Lameter

Maybe the result of free pointer corruption due to writing to an object
after free. Please run again with slub_debug specified on the commandline
to get detailed reports on how this came about.

On Sun, 17 Feb 2013, Sasha Levin wrote:

> Hi all,
>
> I was fuzzing with trinity inside a KVM tools guest, running latest -next 
> kernel,
> and hit the following bug:
>
> [  169.773688] BUG: unable to handle kernel NULL pointer dereference at 
> 0001
> [  169.774976] IP: [] memset+0x1f/0xb0
> [  169.775989] PGD 93e02067 PUD ac1a2067 PMD 0
> [  169.776898] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  169.777923] Dumping ftrace buffer:
> [  169.778595](ftrace buffer empty)
> [  169.779352] Modules linked in:
> [  169.779996] CPU 0
> [  169.780031] Pid: 13438, comm: trinity Tainted: GW
> 3.8.0-rc7-next-20130215-sasha-3-gea816fa #286
> [  169.780031] RIP: 0010:[]  [] 
> memset+0x1f/0xb0
> [  169.780031] RSP: 0018:8800aef19e00  EFLAGS: 00010206
> [  169.780031] RAX:  RBX:  RCX: 
> 0080
> [  169.780031] RDX:  RSI:  RDI: 
> 0001
> [  169.780031] RBP: 8800aef19e68 R08:  R09: 
> 0001
> [  169.780031] R10: 0001 R11:  R12: 
> 8800bb001b00
> [  169.780031] R13: 8800bb001b00 R14: 0001 R15: 
> 00537000
> [  169.780031] FS:  7fb73581b700() GS:8800bb60() 
> knlGS:
> [  169.780031] CS:  0010 DS:  ES:  CR0: 80050033
> [  169.780031] CR2: 0001 CR3: aed31000 CR4: 
> 000406f0
> [  169.780031] DR0:  DR1:  DR2: 
> 
> [  169.780031] DR3:  DR6: 0ff0 DR7: 
> 0400
> [  169.780031] Process trinity (pid: 13438, threadinfo 8800aef18000, task 
> 8800aed7b000)
> [  169.780031] Stack:
> [  169.780031]  81269586 8800aef19e38 0280 
> 81291a2e
> [  169.780031]  80d0aa39d9e8 8800aef19e48  
> 8800aa39d9e8
> [  169.780031]  8800a65cc780 8800aa39d960 ffea 
> 
> [  169.780031] Call Trace:
> [  169.780031]  [] ? kmem_cache_alloc_trace+0x176/0x330
> [  169.780031]  [] ? alloc_pipe_info+0x3e/0xa0
> [  169.780031]  [] alloc_pipe_info+0x3e/0xa0
> [  169.780031]  [] get_pipe_inode+0x36/0xe0
> [  169.780031]  [] create_pipe_files+0x23/0x140
> [  169.780031]  [] __do_pipe_flags+0x3d/0xe0
> [  169.780031]  [] sys_pipe2+0x1b/0xa0
> [  169.780031]  [] ? tracesys+0x7e/0xe6
> [  169.780031]  [] sys_pipe+0xb/0x10
> [  169.780031]  [] tracesys+0xe1/0xe6
> [  169.780031] Code: 1e 44 88 1f c3 90 90 90 90 90 90 90 49 89 f9 48 89 d1 83 
> e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01
> 01 01 01 48 0f af c6  48 ab 89 d1 f3 aa 4c 89 c8 c3 66 66 66 90 66 66 66 
> 90 66 66
> [  169.780031] RIP  [] memset+0x1f/0xb0
> [  169.780031]  RSP 
> [  169.780031] CR2: 0001
> [  169.930103] ---[ end trace 4d135f3def21b4bd ]---
>
> The code translates to the following in fs/pipe.c:alloc_pipe_info :
>
> pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
> if (pipe) {
> pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * 
> PIPE_DEF_BUFFERS, GFP_KERNEL); <=== this
> if (pipe->bufs) {
> init_waitqueue_head(&pipe->wait);
>
>
> Thanks,
> Sasha
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

On Fri, 22 Feb 2013, Glauber Costa wrote:

> Although not verified in practice, I also point out that it is not safe to 
> scan
> the full list only when debugging is on in this case. As unlikely as it is, it
> is theoretically possible for the pages to be full. If they are, they will
> become unreachable. Aside from scanning the full list, we also need to make
> sure that the pages indeed sit in there: the easiest way to do it is to make
> sure the boot caches have the SLAB_STORE_USER debug flag set.

SLAB_STORE_USER typically increases the size of the managed object. It is
not available when slab debugging is not compiled in. There is no list of
full slab objects that is maintained in the non debug case and if the
allocator is compiled without debug support also the code to manage full
lists will not be present.

Only one or two kmem_cache item is allocated in the bootstrap code and so
far the size of the objects was signficantly smaller than page size. So
the slab pages will be on the partial lists. Why are your slab management
structures so large that a page can no longer contain multiple objects?

If you have that issue then I would suggest to find another way to track
the early object (you could f.e. determine the page addresses from the
object and go through the slab pages like that).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

On Fri, 22 Feb 2013, Glauber Costa wrote:

> As I've mentioned in the description, the real bug is from partial slabs
> being temporarily in the cpu_slab during a recent allocation and
> therefore unreachable through the partial list.

The bootstrap code does not use cpu slabs but goes directly to the slab
pages. See early_kmem_cache_node_alloc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

On Fri, 22 Feb 2013, Glauber Costa wrote:

> At this point, we are already slab_state == PARTIAL, while
> init_kmem_cache_nodes will only differentiate against slab_state == DOWN.

kmem_cache_node creation runs before PARTIAL and kmem_cache runs
after. So there would be 2 kmem_cache_node structures allocated. Ok so
that would use cpu slabs and therefore remove pages from the partial list.
Pushing that back using the flushing should fix this. But I thought there
was already code that went through the cpu slabs to address this?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

On Fri, 22 Feb 2013, Glauber Costa wrote:

> On 02/22/2013 08:10 PM, Christoph Lameter wrote:
> > kmem_cache_node creation runs before PARTIAL and kmem_cache runs
> > after. So there would be 2 kmem_cache_node structures allocated. Ok so
> > that would use cpu slabs and therefore remove pages from the partial list.
> > Pushing that back using the flushing should fix this. But I thought there
> > was already code that went through the cpu slabs to address this?
>
> not in bootstrap(), which is quite primitive. (and should remain so)

Joonsoo Kim had a patch for this. I acked it a while back AFAICR


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

On Fri, 22 Feb 2013, Glauber Costa wrote:

> After we create a boot cache, we may allocate from it until it is bootstraped.
> This will move the page from the partial list to the cpu slab list. If this
> happens, the loop:

Acked-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

An earlier fix to this is available here:

https://patchwork.kernel.org/patch/1975301/

and

https://lkml.org/lkml/2013/1/15/55

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

Argh. This one was the final version:

https://patchwork.kernel.org/patch/2009521/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-22 Thread Christoph Lameter

On Fri, 22 Feb 2013, Glauber Costa wrote:

> On 02/22/2013 09:01 PM, Christoph Lameter wrote:
> > Argh. This one was the final version:
> >
> > https://patchwork.kernel.org/patch/2009521/
> >
>
> It seems it would work. It is all the same to me.
> Which one do you prefer?

Flushing seems to be simpler and less code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build warning after merge of the final tree (slab tree related)

2012-09-11 Thread Christoph Lameter

On Tue, 11 Sep 2012, Stephen Rothwell wrote:

> After merging the final tree, today's linux-next build (sparc64 defconfig)
> produced this warning:
>
> mm/slab.c:808:13: warning: '__slab_error' defined but not used 
> [-Wunused-function]
>
> Introduced by commit 945cf2b6199b ("mm/sl[aou]b: Extract a common function 
> for kmem_cache_destroy").  All uses of slab_error() are now guarded by DEBUG.


Subject: Slab: Only define slab_error for DEBUG

There is no use case left for slab builds without DEBUG.

Signed-off-by: Christoph Lameter 

Index: linux/mm/slab.c
===
--- linux.orig/mm/slab.c2012-09-11 14:44:56.304015235 -0500
+++ linux/mm/slab.c 2012-09-11 14:48:46.988948440 -0500
@@ -803,6 +803,7 @@ static void cache_estimate(unsigned long
*left_over = slab_size - nr_objs*buffer_size - mgmt_size;
 }

+#if DEBUG
 #define slab_error(cachep, msg) __slab_error(__func__, cachep, msg)

 static void __slab_error(const char *function, struct kmem_cache *cachep,
@@ -812,6 +813,7 @@ static void __slab_error(const char *fun
   function, cachep->name, msg);
dump_stack();
 }
+#endif

 /*
  * By default on NUMA we use alien caches to stage the freeing of
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter

On Wed, 30 Jan 2008, Jack Steiner wrote:

> > Seems that we cannot rely on the invalidate_ranges for correctness at all?
> > We need to have invalidate_page() always. invalidate_range() is only an 
> > optimization.
> > 
> 
> I don't understand your point "an optimization". How would invalidate_range
> as currently defined be correctly used?

We are changing definitions. The original patch by Andrea calls 
invalidate_page for each pte that is cleared. So strictly you would not 
need an invalidate_range.

> It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
> the page & drops it when invalidate_range is called. That may work (not sure)
> for xpmem but not for the GRU.

The refcount is not necessary if we adopt Andrea's approach of a callback 
on the clearing of each pte. At that point the page is still guaranteed to 
exist. If we do the range_invalidate later (as in V3) then the page may 
have been released (see sys_remap_file_pages() f.e.) before we zap the GRU 
ptes. So there will be a time when the GRU may write to a page that has 
been freed and used for another purpose.

Taking a refcount on the page defers the free until the range_invalidate 
runs.

I would prefer a solution that does not require taking refcounts (pins) 
for establishing an external pte and for release (like what the GRU does).

If we could effectively determine that there are no external ptes in a 
range then the invalidate_page() call may return immediately. Maybe it is 
then effective to do these gazillions of invalidate_page() calls when a 
process terminates or an remap is performed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/6] mmu_notifier: Core code

2008-01-30 Thread Christoph Lameter

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> > I think Andrea's original concept of the lock in the mmu_notifier_head
> > structure was the best.  I agree with him that it should be a spinlock
> > instead of the rw_lock.
> 
> BTW, I don't see the scalability concern with huge number of tasks:
> the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction;
> up_write(mm->mmap_sem) is always going to scale worse than
> spin_lock(mm->somethingelse); oneinstruction;
> spin_unlock(mm->somethinglese).

If we put it elsewhere in the mm then we increase the size of the memory 
used in the mm_struct.

> Furthermore if we go this route and we don't relay on implicit
> serialization of all the mmu notifier users against exit_mmap
> (i.e. the mmu notifier user must agree to stop calling
> mmu_notifier_register on a mm after the last mmput) the autodisarming
> feature will likely have to be removed or it can't possibly be safe to
> run mmu_notifier_unregister while mmu_notifier_release runs. With the
> auto-disarming feature, there is no way to safely know if
> mmu_notifier_unregister has to be called or not. I'm ok with removing
> the auto-disarming feature and to have as self-contained-as-possible
> locking. Then mmu_notifier_release can just become the
> invalidate_all_after and invalidate_all, invalidate_all_before.

H.. exit_mmap is only called when the last reference is removed 
against the mm right? So no tasks are running anymore. No pages are left. 
Do we need to serialize at all for mmu_notifier_release?

 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> > -   void (*invalidate_range)(struct mmu_notifier *mn,
> > +   void (*invalidate_range_begin)(struct mmu_notifier *mn,
> >  struct mm_struct *mm,
> > -unsigned long start, unsigned long end,
> >  int lock);
> > +
> > +   void (*invalidate_range_end)(struct mmu_notifier *mn,
> > +struct mm_struct *mm,
> > +unsigned long start, unsigned long end);
> >  };
> 
> start/finish/begin/end/before/after? ;)

Well lets pick one and then stick to it.

> I'd drop the 'int lock', you should skip the before/after if
> i_mmap_lock isn't null and offload it to the caller before taking the
> lock. At least for the "after" call that looks a few liner change,
> didn't figure out the "before" yet.

How we offload that? Before the scan of the rmaps we do not have the 
mmstruct. So we'd need another notifier_rmap_callback.

> Given the amount of changes that are going on in design terms to cover
> both XPMEM and GRE, can we split the minimal invalidate_page that
> provides an obviously safe and feature complete mmu notifier code for
> KVM, and merge that first patch that will cover KVM 100%, it will

The obvious solution does not scale. You will have a callback for every 
page and there may be a million of those if you have a 4GB process.

> made so that are extendible in backwards compatible way. I think
> invalidate_page inside ptep_clear_flush is the first fundamental block
> of the mmu notifiers. Then once the fundamental is in and obviously
> safe and feature complete for KVM, the rest can be added very easily
> with incremental patches as far as I can tell. That would be my
> preferred route ;)

We need to have a coherent notifier solution that works for multiple 
scenarios. I think a working invalidate_range would also be required for 
KVM. KVM and GRUB are very similar so they should be able to use the same 
mechanisms and we need to properly document how that mechanism is safe. 
Either both take a page refcount or none.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-01-30 Thread Christoph Lameter

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> > H.. exit_mmap is only called when the last reference is removed 
> > against the mm right? So no tasks are running anymore. No pages are left. 
> > Do we need to serialize at all for mmu_notifier_release?
> 
> KVM sure doesn't need any locking there.  I thought somebody had to
> possibly take a pin on the "mm_count" and pretend to call
> mmu_notifier_register at will until mmdrop was finally called, in a
> out of order fashion given mmu_notifier_release was implemented like
> if the list could change from under it. Note mmdrop != mmput. mmput
> and in turn mm_users is the serialization point if you prefer to drop
> all locking from _release. Nobody must ever attempt a mmu_notifier_*
> after calling mmput for that mm. That should be enough to be
> safe. I'm fine either ways...

exit_mmap (where we call invalidate_all() and release()) is called when 
mm_users == 0:

void mmput(struct mm_struct *mm)
{
might_sleep();

if (atomic_dec_and_test(&mm->mm_users)) {
exit_aio(mm);
exit_mmap(mm);
if (!list_empty(&mm->mmlist)) {
spin_lock(&mmlist_lock);
list_del(&mm->mmlist);
spin_unlock(&mmlist_lock);
}
put_swap_token(mm);
mmdrop(mm);
}
}
EXPORT_SYMBOL_GPL(mmput);

So there is only a single thread executing at the time when 
invalidate_all() is called from exit_mmap(). Then we drop the 
pages, and the page tables. After the page tables we call the ->release 
method and then remove the vmas.

So even dropping off the mmu_notifier chain in invalidate_all() could be 
done without an issue and without locking.

Trouble is if other callbacks attempt the same. Do we need to support the 
removal from the mmu_notifier list in invalidate_range()?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote:
> > How we offload that? Before the scan of the rmaps we do not have the 
> > mmstruct. So we'd need another notifier_rmap_callback.
> 
> My assumption is that that "int lock" exists just because
> unmap_mapping_range_vma exists. If I'm right then my suggestion was to
> move the invalidate_range after dropping the i_mmap_lock and not to
> invoke it inside zap_page_range.

There is still no pointer to the mm_struct available there because pages 
of a mapping may belong to multiple processes. So we need to add another 
rmap method?

The same issue is also occurring for unmap_hugepages().

> There's no reason why KVM should take any risk of corrupting memory
> due to a single missing mmu notifier, with not taking the
> refcount. get_user_pages will take it for us, so we have to pay the
> atomic-op anyway. It sure worth doing the atomic_dec inside the mmu
> notifier, and not immediately like this:

Well the GRU uses follow_page() instead of get_user_pages. Performance is 
a major issue for the GRU. 

> get_user_pages(pages)
> __free_page(pages[0])
> 
> The idea is that what works for GRU, works for KVM too. So we do a
> single invalidate_page and clustered invalidate_pages, we add that,
> and then we make sure all places are covered so GRU will not
> kernel-crash, and KVM won't risk to run oom or to generate _userland_
> corruption.

H.. Could we go to a scheme where we do not have to increase the page 
count? Modifications of the page struct require dirtying a cache line and 
it seems that we do not need an increased page count if we have an
invalidate_range_start() that clears all the external references 
and stops the establishment of new ones and invalidate_range_end() that 
reenables new external references?

Then we do not need the frequent invalidate_page() calls.

The typical case would be anyways that invalidate_all() is called 
before anything else on exit. Invalidate_all() would remove all pages 
and disable creation of new references to the memory in the mm_struct.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter


Patch to


1. Remove sync on notifier_release. Must be called when only a 
   single process remain.

2. Add invalidate_range_start/end. This should allow safe removal
   of ranges of external ptes without having to resort to a callback
   for every individual page.

This must be able to nest so the driver needs to keep a refcount of range 
invalidates and wait if the refcount != 0.


---
 include/linux/mmu_notifier.h |   11 +--
 mm/fremap.c  |3 ++-
 mm/hugetlb.c |3 ++-
 mm/memory.c  |   16 ++--
 mm/mmu_notifier.c|9 -
 5 files changed, 27 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/mmu_notifier.c
===
--- linux-2.6.orig/mm/mmu_notifier.c2008-01-30 17:58:48.0 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-30 18:00:26.0 -0800
@@ -13,23 +13,22 @@
 #include 
 #include 
 
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
 void mmu_notifier_release(struct mm_struct *mm)
 {
struct mmu_notifier *mn;
struct hlist_node *n, *t;
 
if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
-   down_write(&mm->mmap_sem);
-   rcu_read_lock();
hlist_for_each_entry_safe_rcu(mn, n, t,
  &mm->mmu_notifier.head, hlist) {
hlist_del_rcu(&mn->hlist);
if (mn->ops->release)
mn->ops->release(mn, mm);
}
-   rcu_read_unlock();
-   up_write(&mm->mmap_sem);
-   synchronize_rcu();
}
 }
 
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 17:58:48.0 
-0800
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-30 18:00:26.0 
-0800
@@ -67,15 +67,22 @@ struct mmu_notifier_ops {
int dummy);
 
/*
+* invalidate_range_begin() and invalidate_range_end() are paired.
+*
+* invalidate_range_begin must clear all references in the range
+* and stop the establishment of new references.
+*
+* invalidate_range_end() reenables the establishment of references.
+*
 * lock indicates that the function is called under spinlock.
 */
void (*invalidate_range_begin)(struct mmu_notifier *mn,
 struct mm_struct *mm,
+unsigned long start, unsigned long end,
 int lock);
 
void (*invalidate_range_end)(struct mmu_notifier *mn,
-struct mm_struct *mm,
-unsigned long start, unsigned long end);
+struct mm_struct *mm);
 };
 
 struct mmu_rmap_notifier_ops;
Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-01-30 17:58:48.0 -0800
+++ linux-2.6/mm/fremap.c   2008-01-30 18:00:26.0 -0800
@@ -212,8 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}
 
+   mmu_notifier(invalidate_range_start, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
-   mmu_notifier(invalidate_range, mm, start, start + size, 0);
+   mmu_notifier(invalidate_range_end, mm);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===
--- linux-2.6.orig/mm/hugetlb.c 2008-01-30 17:58:48.0 -0800
+++ linux-2.6/mm/hugetlb.c  2008-01-30 18:00:26.0 -0800
@@ -744,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar
BUG_ON(start & ~HPAGE_MASK);
BUG_ON(end & ~HPAGE_MASK);
 
+   mmu_notifier(invalidate_range_start, mm, start, end, 1);
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
@@ -764,7 +765,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
-   mmu_notifier(invalidate_range, mm, start, end, 1);
+   mmu_notifier(invalidate_range_end, mm);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-30 17:58:48.0

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter

On Wed, 30 Jan 2008, Robin Holt wrote:

> > Well the GRU uses follow_page() instead of get_user_pages. Performance is 
> > a major issue for the GRU. 
> 
> Worse, the GRU takes its TLB faults from within an interrupt so we
> use follow_page to prevent going to sleep.  That said, I think we
> could probably use follow_page() with FOLL_GET set to accomplish the
> requirements of mmu_notifier invalidate_range call.  Doesn't look too
> promising for hugetlb pages.

There may be no need to with the range_start/end scheme. The driver can 
have its own lock to make follow page secure. The lock needs to serialize 
the follow_page handler and the range_start/end calls as well as the 
invalidate_page callouts. I think that avoids the need for 
get_user_pages().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote:
> > hlist_for_each_entry_safe_rcu(mn, n, t,
>
> 
> >   &mm->mmu_notifier.head, hlist) {
> > hlist_del_rcu(&mn->hlist);
>
> 
> _rcu can go away from both, if hlist_del_rcu can be called w/o locks.

True. hlist_del_init ok? That would allow to check the driver that the 
mmu_notifier is already linked in using !hlist_unhashed(). Driver then 
needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] mmu_notifier: invalidate_range_start with lock=1

2008-01-30 Thread Christoph Lameter

One possible way that XPmem could deal with a call of 
invalidate_range_start with the lock flag set:

Scan through the rmaps you have for ptes. If you find one then elevate the 
refcount of the corresponding page and mark in the maps that you have done 
so. Also make them readonly. The increased refcount will prevent the 
freeing of the page. The page will be unmapped from the process and XPmem 
will retain the only reference.

Then some shepherding process that you have anyways with XPmem can 
sometime later zap the remote ptes and free the pages. Would leave stale 
data visible on the remote side for awhile. Would that be okay?

This would only be used for truncate that uses the unmap_mapping_range 
call. So we are not in reclaim or other distress.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: fix shadowed variable sparse warnings

2008-01-30 Thread Christoph Lameter

On Wed, 30 Jan 2008, Harvey Harrison wrote:

> Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]>
> ---
>  mm/slub.c |   15 ++-
>  1 files changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 5cc4b7d..f9a20bf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3403,19 +3403,19 @@ static int list_locations(struct kmem_cache *s, char 
> *buf,
>   flush_all(s);
>  
>   for_each_node_state(node, N_NORMAL_MEMORY) {
> - struct kmem_cache_node *n = get_node(s, node);
> + struct kmem_cache_node *nd = get_node(s, node);
>   unsigned long flags;
>   struct page *page;
>  
> - if (!atomic_long_read(&n->nr_slabs))
> + if (!atomic_long_read(&nd->nr_slabs))
>   continue;
>  
> - spin_lock_irqsave(&n->list_lock, flags);
> - list_for_each_entry(page, &n->partial, lru)
> + spin_lock_irqsave(&nd->list_lock, flags);
> + list_for_each_entry(page, &nd->partial, lru)
>   process_slab(&t, s, page, alloc);
> - list_for_each_entry(page, &n->full, lru)
> + list_for_each_entry(page, &nd->full, lru)
>   process_slab(&t, s, page, alloc);
> - spin_unlock_irqrestore(&n->list_lock, flags);
> + spin_unlock_irqrestore(&nd->list_lock, flags);
>   }

Could you rename the outer variable instead? That is a counter. So call 
this count or something. The n is used throughout to refer to 
kmem_cache_node structs.
  
>   for (i = 0; i < t.count; i++) {
> @@ -3498,7 +3498,6 @@ static unsigned long slab_objects(struct kmem_cache *s,
>  
>   for_each_possible_cpu(cpu) {
>   struct page *page;
> - int node;
>   struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
>  
>   if (!c)

That is okay.

> @@ -3510,8 +3509,6 @@ static unsigned long slab_objects(struct kmem_cache *s,
>   continue;
>   if (page) {
>   if (flags & SO_CPU) {
> - int x = 0;
> -
>   if (flags & SO_OBJECTS)
>   x = page->inuse;
>   else

Ok too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: system without RAM on node0 boot fail

2008-01-30 Thread Christoph Lameter

x86 supports booting from a node without RAM?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/3] mmu_notifier: invalidate_page callbacks

2008-01-30 Thread Christoph Lameter

Callbacks to remove individual pages as done in rmap code

3 types of callbacks are used:

1. invalidate_page mmu_notifier
Called from the inner loop of rmap walks to invalidate
pages.

2. invalidate_page mmu_rmap_notifier
Called after the Linux rmap loop under PageLock to allow
a device to scan its own rmaps and remove mappings.

3. mmu_notifier_age_page
Called for the determination of the page referenced
status.

The callbacks occur after the Linux rmaps have been walked. A device
driver does not have to support type 1 and 2 callbacks. One is sufficient.
If we do not care about page referenced status then callback #3 can also
be omitted.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Robin Holt <[EMAIL PROTECTED]>

---
 mm/rmap.c |   22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2008-01-30 20:03:03.0 -0800
+++ linux-2.6/mm/rmap.c 2008-01-30 20:17:22.0 -0800
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -284,7 +285,8 @@ static int page_referenced_one(struct pa
if (!pte)
goto out;
 
-   if (ptep_clear_flush_young(vma, address, pte))
+   if (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address))
referenced++;
 
/* Pretend the page is referenced if the task has the
@@ -434,6 +436,7 @@ static int page_mkclean_one(struct page 
 
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -473,6 +476,10 @@ int page_mkclean(struct page *page)
struct address_space *mapping = page_mapping(page);
if (mapping) {
ret = page_mkclean_file(mapping, page);
+   if (unlikely(PageExternalRmap(page))) {
+   mmu_rmap_notifier(invalidate_page, page);
+   ClearPageExternalRmap(page);
+   }
if (page_test_dirty(page)) {
page_clear_dirty(page);
ret = 1;
@@ -677,7 +684,8 @@ static int try_to_unmap_one(struct page 
 * skipped over this mm) then we should reactivate it.
 */
if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-   (ptep_clear_flush_young(vma, address, pte {
+   (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address {
ret = SWAP_FAIL;
goto out_unmap;
}
@@ -685,6 +693,7 @@ static int try_to_unmap_one(struct page 
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
 
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -809,12 +818,14 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));
 
-   if (ptep_clear_flush_young(vma, address, pte))
+   if (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address))
continue;
 
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
 
/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
@@ -971,6 +982,11 @@ int try_to_unmap(struct page *page, int 
else
ret = try_to_unmap_file(page, migration);
 
+   if (unlikely(PageExternalRmap(page))) {
+   mmu_rmap_notifier(invalidate_page, page);
+   ClearPageExternalRmap(page);
+   }
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/3] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

invalidate_range_begin/end() is frequently called with only mmap_sem
held. If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible.

In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).

do_wp_page(): We hold off new references while update the pte.

xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
Signed-off-by: Robin Holt <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/filemap_xip.c |4 
 mm/fremap.c  |3 +++
 mm/hugetlb.c |3 +++
 mm/memory.c  |   15 +--
 mm/mmap.c|2 ++
 5 files changed, 25 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-01-30 20:03:05.0 -0800
+++ linux-2.6/mm/fremap.c   2008-01-30 20:05:39.0 -0800
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -211,7 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}
 
+   mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
+   mmu_notifier(invalidate_range_end, mm, 0);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-30 20:03:05.0 -0800
+++ linux-2.6/mm/memory.c   2008-01-30 20:07:27.0 -0800
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -883,13 +884,16 @@ unsigned long zap_page_range(struct vm_a
struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
+   int atomic = details ? (details->i_mmap_lock != 0) : 0;
 
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+   mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range_end, mm, atomic);
return end;
 }
 
@@ -1318,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;
 
@@ -1352,6 +1356,7 @@ int remap_pfn_range(struct vm_area_struc
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
+   mmu_notifier(invalidate_range_begin, mm, start, end, 0);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1359,6 +1364,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range_end, mm, 0);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1442,10 +1448,11 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned long start = addr, end = addr + size;
int err;
 
BUG_ON(addr >= end);
+   mmu_notifier(invalidate_range_begin, mm, start, end, 0);
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
@@ -1453,6 +1460,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range_end, mm, 0);
return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1630,6 +1638,8 @@ gotten:
goto oom;
cow_user_page(new_page, old_page, address, vma);
 
+   mmu_notifier(invalidate_range_begin, mm, address,
+   address + PAGE_SIZE - 1, 0);
/*
 * Re-check the pte - we dropped the lock
 */
@@ -1668,6 +1678,7 @@ gotten:
page_cache_release(old_page);
 unlo

[patch 0/3] [RFC] MMU Notifiers V4

2008-01-30 Thread Christoph Lameter

I hope this is finally a release that covers all the requirements. Locking
description is at the top of the core patch.

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel. The known immediate users are

KVM (establishes a refcount to the page. External references called spte)
(Refcount seems to be not necessary)

GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb)

XPmem (uses its own reverse mappings. Remote ptes, Needs
to sleep when sending messages)
XPmem could defer freeing pages if a callback with atomic=1 occurs.

Pending:

- Feedback from users of the callbacks for KVM, RDMA, XPmem and GRU
  (Early tests with the GRU were successful).

Known issues:

- RCU quiescent periods are required on registering
  notifiers to guarantee visibility to other processors.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
  and sys_remap_file_pages() after the pte clearing.

V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
  we are the only executing thread. This is also true for invalidate_all() so
  we could drop off the mmu_notifier there early. Use hlist_del_init instead
  of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
  holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported 
bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/3] mmu_notifier: Core code

2008-01-30 Thread Christoph Lameter

Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

These fall into two classes:

1. mmu_notifier

 These are callbacks registered with an mm_struct. If pages are
 removed from an address space then callbacks are performed.

 Spinlocks must be held in order to walk reverse maps. The
 invalidate_page() callbacks are performed with spinlocks are held.

 The invalidate_range_start/end callbacks can be performed in contexts
 where sleeping is allowed or in atomic contexts. A flag is passed
 to indicate an atomic context.


2. mmu_rmap_notifier

 Callbacks for subsystems that provide their own rmaps. These
 need to walk their own rmaps for a page. The invalidate_page
 callback is outside of locks so that we are not in a strictly
 atomic context (but we may be in a PF_MEMALLOC context if the
 notifier is called from reclaim code) and are able to sleep.

 Rmap notifiers need an extra page bit and are only available
 on 64 bit platforms.

 Pages must be marked dirty if dirty bits are found to be set in
 the external ptes.

Requirements on synchronization within the driver:

 Multiple invalidate_range_begin/ends may be nested or called
 concurrently. That is legit. However, no new external references
 may be established as long as any invalidate_xxx is running or
 any invalidate_range_begin() and has not been completed through a
 corresponding call to invalidate_range_end().

 Locking within the notifier needs to serialize events correspondingly.

 If all invalidate_xx notifier calls take a driver lock then it is possible
 to run follow_page() under the same lock. The lock can then guarantee
 that no page is removed and provides an additional existence guarantee
 of the page.

 invalidate_range_begin() must clear all references in the range
 and stop the establishment of new references.

 invalidate_range_end() reenables the establishment of references.
 atomic indicates that the function is called in an atomic context.
 We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

---
 include/linux/mm_types.h |6 +
 include/linux/mmu_notifier.h |  248 +++
 include/linux/page-flags.h   |   11 +
 kernel/fork.c|2 
 mm/Kconfig   |4 
 mm/Makefile  |1 
 mm/mmap.c|3 
 mm/mmu_notifier.c|  118 
 8 files changed, 393 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-30 19:49:32.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-30 19:49:34.0 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+   struct hlist_head head;
+};
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+
+   struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-30 20:25:43.0 
-0800
@@ -0,0 +1,248 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If mappings are
+ * removed from an address space then callbacks are performed.
+ *
+ * Spinlocks must be held in order to walk reverse maps. The
+ * invalidate_page notifications are performed with spinlocks are held.
+ *
+ * The invalidate_range_start/end callbacks can be performed in contexts
+ * where sleeping is allowed or in atomic contexts. A flag is passed
+ * to indicate an atomic context.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we

Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 08:57:52PM -0800, Christoph Lameter wrote:
> > @@ -211,7 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
> > spin_unlock(&mapping->i_mmap_lock);
> > }
> >  
> > +   mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
> > err = populate_range(mm, vma, start, size, pgoff);
> > +   mmu_notifier(invalidate_range_end, mm, 0);
> > if (!err && !(flags & MAP_NONBLOCK)) {
> > if (unlikely(has_write_lock)) {
> > downgrade_write(&mm->mmap_sem);
> 
> This can't be enough for GRU, infact it can't work for KVM either. You
> got 1) to have some invalidate_page for GRU before freeing the page,
> and 2) to pass start, end to range_end (if you want kvm to use it
> instead of invalidate_page).

The external references are dropped when calling invalidate_range_begin. 
This would work both for the KVM and the GRU. Why would KVM not be able to 
invalidate the range before? Locking conventions is that no additional 
external reference can be added between invalidate_range_begin and 
invalidate_range_end. So KVM is fine too.

> mremap still missing as a whole.

mremap uses do_munmap which calls into unmap_region() that already has 
callbacks. So what is wrong there?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> My suggestion is to add the invalidate_range_start/end incrementally
> with this, and to keep all the xpmem mmu notifiers in a separate
> incremental patch (those are going to require many more changes to
> perfect). They've very different things. GRU is simpler, will require
> less changes and it should be taken care of sooner than XPMEM. KVM
> requirements are a subset of GRU thanks to the page pin so I can
> ignore KVM requirements as a whole and I only focus on GRU for the
> time being.

KVM requires get_user_pages. This makes them currently different.

> Invalidates inside PT lock will avoid the page faults to happen in
> parallel of my invalidates, no dependency on the page pin, mremap

You are aware that the pt lock is split for systems with >4 CPUS? You can 
use the pte_lock only to serialize access to individual ptes.

> pagefault against the main linux page fault, given we already have all
> needed serialization out of the PT lock. XPMEM is forced to do that

pt lock cannot serialize with invalidate_range since it is split. A range 
requires locking for a series of ptes not only individual ones.

> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -46,6 +46,7 @@
>   __young = ptep_test_and_clear_young(__vma, __address, __ptep);  \
>   if (__young)\
>   flush_tlb_page(__vma, __address);   \
> + __young |= mmu_notifier_age_page((__vma)->vm_mm, __address);\
>   __young;\
>  })
>  #endif

That may be okay. Have you checked all the arches that can provide their 
own implementation of this macro? This is only going to work on arches 
that use the generic implementation.

 > @@ -86,6 +87,7 @@ do {   
 > \
>   pte_t __pte;\
>   __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);  \
>   flush_tlb_page(__vma, __address);   \
> + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);   \
>   __pte;  \
>  })
>  #endif

This will require a callback on every(!) removal of a pte. A range 
invalidate does not do any good since the callbacks are performed anyways. 
Probably needlessly.

In addition you have the same issues with arches providing their own macro 
here.

> diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
> --- a/include/asm-s390/pgtable.h
> +++ b/include/asm-s390/pgtable.h
> @@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
>  {
>   pte_t pte = *ptep;
>   ptep_invalidate(address, ptep);
> + mmu_notifier(invalidate_page, vma->vm_mm, address);
>   return pte;
>  }
>  

Ahh you found an additional arch. How about x86 code? There is one 
override of these functions there as well.

> + /*
> +  * invalidate_page[s] is called in atomic context
> +  * after any pte has been updated and before
> +  * dropping the PT lock required to update any Linux pte.
> +  * Once the PT lock will be released the pte will have its
> +  * final value to export through the secondary MMU.
> +  * Before this is invoked any secondary MMU is still ok
> +  * to read/write to the page previously pointed by the
> +  * Linux pte because the old page hasn't been freed yet.
> +  * If required set_page_dirty has to be called internally
> +  * to this method.
> +  */
> + void (*invalidate_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);


> + void (*invalidate_pages)(struct mmu_notifier *mn,
> +  struct mm_struct *mm,
> +  unsigned long start, unsigned long end);

What is the point of invalidate_pages? It cannot be serialized properly 
and you do the invalidate_page() calles regardless. Is is some sort of 
optimization?

> +struct mmu_notifier_head {};
> +
> +#define mmu_notifier_register(mn, mm) do {} while(0)
> +#define mmu_notifier_unregister(mn, mm) do {} while (0)
> +#define mmu_notifier_release(mm) do {} while (0)
> +#define mmu_notifier_age_page(mm, address) ({ 0; })
> +#define mmu_notifier_head_init(mmh) do {} while (0)

Macros. We want functions there to be able to validate the parameters even 
if !CONFIG_MMU_NOTIFIER.

> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...) \
>

mmu_notifier: close hole in fork

2008-01-31 Thread Christoph Lameter

Talking to Robin and Jack we found taht we still have a hole during fork. 
Fork may set a pte writeprotect. At that point the remote pte are 
not marked readonly(!). Remote writes may occur to pages that are marked 
readonly locally without this patch.

mmu_notifier: Provide invalidate_range on fork

On fork we change ptes in cow mappings to readonly. This means we must
invalidate the ptes so that they are reestablished later with proper
permission.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/memory.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-31 13:42:35.0 -0800
+++ linux-2.6/mm/memory.c   2008-01-31 13:47:31.0 -0800
@@ -602,6 +602,9 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+   if (is_cow_mapping(vma->vm_flags))
+   mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
@@ -612,6 +615,9 @@ int copy_page_range(struct mm_struct *ds
vma, addr, next))
return -ENOMEM;
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+   if (is_cow_mapping(vma->vm_flags))
+   mmu_notifier(invalidate_range_end, src_mm, 0);
return 0;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] include/asm-generic/tlb.h: fix a missing header

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, WANG Cong wrote:

> index 6ce9f3a..4ebbe15 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

Please also remove the #include . It should have been 
part of 
a patch 
reversal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback

2008-01-31 Thread Christoph Lameter

mmu_notifier: Move mmu_notifier_release up to get rid of invalidate_all()

It seems that it is safe to call mmu_notifier_release() before we tear down
the pages and the vmas since we are the only executing thread. 
mmu_notifier_release
can then also tear down all its external ptes and thus we can get rid
of the invalidate_all() callback.

During the final teardown no mmu_notifier calls are registered anymore which
will speed up exit processing.

Is this okay for KVM too?

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mmu_notifier.h |4 
 mm/mmap.c|3 +--
 2 files changed, 1 insertion(+), 6 deletions(-)

Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-31 14:17:17.0 
-0800
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-31 14:17:28.0 
-0800
@@ -59,10 +59,6 @@ struct mmu_notifier_ops {
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
 
-   /* Dummy needed because the mmu_notifier() macro requires it */
-   void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm,
-   int dummy);
-
/*
 * age_page is called from contexts where the pte_lock is held
 */
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-01-31 14:16:51.0 -0800
+++ linux-2.6/mm/mmap.c 2008-01-31 14:17:10.0 -0800
@@ -2036,7 +2036,7 @@ void exit_mmap(struct mm_struct *mm)
unsigned long end;
 
/* mm's last user has gone, and its about to be pulled down */
-   mmu_notifier(invalidate_all, mm, 0);
+   mmu_notifier_release(mm);
arch_exit_mmap(mm);
 
lru_add_drain();
@@ -2048,7 +2048,6 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
-   mmu_notifier_release(mm);
 
/*
 * Walk the list again, actually closing and freeing it,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mmu_notifier: reduce size of mm_struct if !CONFIG_MMU_NOTIFIER

2008-01-31 Thread Christoph Lameter

mmu_notifier: Reduce size of mm_struct if !CONFIG_MMU_NOTIFIER

Andrea and Peter had a concern about this.

Use an #ifdef to make the mmu_notifer_head structure empty if we have
no notifier. That allows the use of the structure in inline functions
(which allows parameter verification even if !CONFIG_MMU_NOTIFIER)

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mm_types.h |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-31 14:03:23.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-31 14:03:38.0 -0800
@@ -154,7 +154,9 @@ struct vm_area_struct {
 };
 
 struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
struct hlist_head head;
+#endif
 };
 
 struct mm_struct {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Christoph Lameter wrote:

> > pagefault against the main linux page fault, given we already have all
> > needed serialization out of the PT lock. XPMEM is forced to do that
> 
> pt lock cannot serialize with invalidate_range since it is split. A range 
> requires locking for a series of ptes not only individual ones.

Hmmm.. May be okay after all. I see that you are only doing it on the pte 
level. This means the range callbacks are taking down a max of 512 
entries. So you have a callback for each pmd. A callback for 2M of memory?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-01-31 Thread Christoph Lameter

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> I appreciate the review! I hope my entirely bug free and
> strightforward #v5 will strongly increase the probability of getting
> this in sooner than later. If something else it shows the approach I
> prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
> requirements only to the mmu notifier users that can't deal with the
> scalar and finegrined and already-taken/trashed PT lock.

Mutex locking? Could you be more specific?

I hope you will continue to do reviews of the other mmu notifier patchset?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-01-31 Thread Christoph Lameter

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> GRU. Thanks to the PT lock this remains a totally obviously safe
> design and it requires zero additional locking anywhere (nor linux VM,
> nor in the mmu notifier methods, nor in the KVM/GRU page fault).

Na. I would not be so sure about having caught all the issues yet...

> Sure you can do invalidate_range_start/end for more than 2M(/4M on
> 32bit) max virtual ranges. But my approach that averages the fixed
> mmu_lock cost already over 512(/1024) ptes will make any larger
> "range" improvement not strongly measurable anymore given to do that
> you have to add locking as well and _surely_ decrease the GRU
> scalability with tons of threads and tons of cpus potentially making
> GRU a lot slower _especially_ on your numa systems.

The trouble is that the invalidates are much more expensive if you have to 
send theses to remote partitions (XPmem). And its really great if you can 
simple tear down everything. Certainly this is a significant improvement 
over the earlier approach but you still have the invalidate_page calls in 
ptep_clear_flush. So they fire needlessly?

Serializing access in the device driver makes sense and comes with 
additional possiblity of not having to increment page counts all the time. 
So you trade one cacheline dirtying for many that are necessary if you 
always increment the page count.

How does KVM insure the consistency of the shadow page tables? Atomic ops?

The GRU has no page table on its own. It populates TLB entries on demand 
using the linux page table. There is no way it can figure out when to 
drop page counts again. The invalidate calls are turned directly into tlb 
flushes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mmu_notifier: close hole in fork

2008-01-31 Thread Christoph Lameter

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> Good catch! This was missing also in my #v5 (KVM doesn't need that
> because the only possible cows on sptes can be generated by ksm, but
> it would have been a problem for GRU). The more I think about it, the

How do you think the GRU should know when to drop the refcount? There is 
no page table and thus no way of tracking that a refcount was taken. 
Without the refcount you cannot defer the freeing of the page. So 
shootdown on invalidate_range_begin and lock out until 
invalidate_range_end seems to be the only workable solution.

BTW what do you think about adding a flag parameter to the invalidate 
calls that allows shooting down writable ptes only? That could be useful 
for COW and page_mkclean.

So

#define MMU_ATOMIC 1
#define MMU_WRITABLE 2

insted of the atomic parameter?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback

2008-01-31 Thread Christoph Lameter

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> On Thu, Jan 31, 2008 at 02:21:58PM -0800, Christoph Lameter wrote:
> > Is this okay for KVM too?
> 
> ->release isn't implemented at all in KVM, only the list_del generates
> complications.

Why would the list_del generate problems?

> I think current code could be already safe through the mm_count pin,
> becasue KVM relies on the fact anybody pinning through mm_count like
> KVM does, is forbidden to call unregister and it's forced to wait the
> auto-disarming when mm_users hits zero, but I feel like something's
> still wrong if I think that I'm not using call_rcu to free the
> notifier (OTOH we agreed the list had to be frozen and w/o readers
> (modulo _release) before _release is called, so if this initial
> assumption is ok it seems I may be safe w/o call_rcu?).

You could pin via mm_users? Then it would be entirely safe and no need for 
rcu tricks?

OTOH if there are mm_count users like in KVM: Could we guarantee that 
they do not perform any operations with the mmu notifier list? Then we 
would be safe as well.

> too soon ;) so let's concentrate on the rest first. I can say
> hlist_del_init doesn't seem to provide any benefit given nobody could
> possibly decide to call register or unregister after _release run.

It is useful if a device driver has a list of data segments that contain 
struct mmu_notifiers. The device driver can inspect the mmu_notifier and 
reliably conclude that the beast is inactive.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mmu_notifier: invalidate_range for move_page_tables

2008-01-31 Thread Christoph Lameter

mmu_notifier: Provide invalidate_range for move_page_tables

Move page tables also needs to invalidate the external references
and hold new references off while moving page table entries.

This is already guaranteed by holding a writelock
on mmap_sem for get_user_pages() but follow_page() is not subject to
the mmap_sem locking.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/mremap.c |4 
 1 file changed, 4 insertions(+)

Index: linux-2.6/mm/mremap.c
===
--- linux-2.6.orig/mm/mremap.c  2008-01-25 14:25:31.0 -0800
+++ linux-2.6/mm/mremap.c   2008-01-31 17:54:19.0 -0800
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -130,6 +131,8 @@ unsigned long move_page_tables(struct vm
old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);
 
+   mmu_notifier(invalidate_range_begin, vma->vm_mm,
+   old_addr, old_addr + len, 0);
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -150,6 +153,7 @@ unsigned long move_page_tables(struct vm
move_ptes(vma, old_pmd, old_addr, old_addr + extent,
new_vma, new_pmd, new_addr);
}
+   mmu_notifier(invalidate_range_end, vma->vm_mm, 0);
 
return len + old_addr - old_end;/* how much done */
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Robin Holt wrote:

> Jack has repeatedly pointed out needing an unregister outside the
> mmap_sem.  I still don't see the benefit to not having the lock in the mm.

I never understood why this would be needed. ->release removes the 
mmu_notifier right now.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Jack Steiner wrote:

> Christoph, is it time to post a new series of patches? I've got
> as many fixup patches as I have patches in the original posting.

Maybe wait another day? This is getting a bit too frequent and so far we 
have only minor changes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mmu_notifier: invalidate_range for move_page_tables

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Robin Holt wrote:

> On Thu, Jan 31, 2008 at 05:57:25PM -0800, Christoph Lameter wrote:
> > Move page tables also needs to invalidate the external references
> > and hold new references off while moving page table entries.
> 
> I must admit to not having spent any time thinking about this, but aren't
> we moving the entries from one set of page tables to the other, leaving
> the pte_t entries unchanged.  I guess I should go look, but could you
> provide a quick pointer in the proper direction as to why we need to
> recall externals when the before and after look of these page tables
> will have the same information for the TLBs.

remap changes the address of pages in a process. The pages appear at 
another address. Thus the external pte will have the wrong information if 
not invalidated.

Do a

man mremap


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Robin Holt wrote:

> > Mutex locking? Could you be more specific?
> 
> I think he is talking about the external locking that xpmem will need
> to do to ensure we are not able to refault pages inside of regions that
> are undergoing recall/page table clearing.  At least that has been my
> understanding to this point.

Right this has to be something like rw spinlock. Its needed for both 
GRU/XPmem. Not sure about KVM.

Take the read lock for invalidate operations. These can occur 
concurrently. (Or a simpler implementation for the GRU may just use a 
spinlock).

The write lock must be held for populate operations.

Lock can be refined as needed by the notifier driver. F.e. locking could 
be restricted to certain ranges.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Robin Holt wrote:

> Both xpmem and GRU have means of removing their context seperate from
> process termination.  XPMEMs is by closing the fd, I believe GRU is
> the same.  In the case of XPMEM, we are able to acquire the mmap_sem.
> For GRU, I don't think it is possible, but I do not remember the exact
> reason.

For any action initiated from user space you will not hold mmap sem. So 
you can call the unregister function. Then you need to do a 
synchronize_rcu before freeing the structures.

It is also possible to shut this down outside via f.e. a control thread. 
The control thread can acquire mmap_sem and then unregister the notifier.

Am I missing something?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Jack Steiner wrote:

> I currently unlink the mmu_notifier when the last GRU mapping is closed. For
> example, if a user does a:
> 
> gru_create_context();
> ...
> gru_destroy_context();
> 
> the mmu_notifier is unlinked and all task tables allocated
> by the driver are freed. Are you suggesting that I leave tables
> allocated until the task terminates??

You are in user space and calling into the kernel somehow. The 
mmap_sem is not held at that point so its no trouble to use the unregister 
function. After that wait for rcu and then free your tables.

> I assumed that I would need to use call_rcu() or synchronize_rcu()
> before the table is actually freed. That's still on my TODO list.

Right.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Robin Holt wrote:

> > +   void (*invalidate_range_end)(struct mmu_notifier *mn,
> > +struct mm_struct *mm, int atomic);
> 
> I think we need to pass in the same start-end here as well.  Without it,
> the first invalidate_range would have to block faulting for all addresses
> and would need to remain blocked until the last invalidate_range has
> completed.  While this would work, (and will probably be how we implement
> it for the short term), it is far from ideal.

Ok. Andrea wanted the same because then he can void the begin callouts.

The problem is that you would have to track the start-end addres right?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges

2008-01-31 Thread Christoph Lameter

On Thu, 31 Jan 2008, Robin Holt wrote:

> > Index: linux-2.6/mm/memory.c
> ...
> > @@ -1668,6 +1678,7 @@ gotten:
> > page_cache_release(old_page);
> >  unlock:
> > pte_unmap_unlock(page_table, ptl);
> > +   mmu_notifier(invalidate_range_end, mm, 0);
> 
> I think we can get an _end call without the _begin call before it.

If that would be true then also the pte would have been left locked.

We always hit unlock. Maybe I just do not see it?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/4] [RFC] EMMU Notifiers V5

2008-01-31 Thread Christoph Lameter

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel (an external MMU).

The known immediate users are

KVM
- Establishes a refcount to the page via get_user_pages().
- External references are called spte.
- Has page tables to track pages whose refcount was elevated(?) but
  no reverse maps.

GRU
- Simple additional hardware TLB (possibly covering multiple instances of
  Linux)
- Needs TLB shootdown when the VM unmaps pages.
- Determines page address via follow_page (from interrupt context) but can
  fall back to get_user_pages().
- No page reference possible since no page status is kept..

XPmem
- Allows use of a processes memory by remote instances of Linux.
- Provides its own reverse mappings to track remote pte.
- Established refcounts on the exported pages.
- Must sleep in order to wait for remote acks of ptes that are being
  cleared.



Known issues:

- RCU quiescent periods are required on registering
  notifiers to guarantee visibility to other processors.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
  and sys_remap_file_pages() after the pte clearing.

V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
  we are the only executing thread. This is also true for invalidate_all() so
  we could drop off the mmu_notifier there early. Use hlist_del_init instead
  of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
  holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported 
bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch

V4->V5:
- Provide missing callouts for mremap.
- Provide missing callouts for copy_page_range.
- Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out
  structure contents.
- Get rid of the invalidate_all() callback by moving ->release in place
  of invalidate_all.
- Require holding mmap_sem on register/unregister instead of acquiring it
  ourselves. In some contexts where we want to register/unregister we are
  already holding mmap_sem.
- Split out the rmap support patch so that there is no need to apply
  all patches for KVM and GRU.

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/4] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter

Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

This first portion is fitting for external mmu's that do not have their
own rmap or need the ability to sleep before removing individual pages.

Two categories of external mmus are possible:

1. KVM style external mmus that have their own page table.
   These are capable of tracking pages in their page tables and
   can therefore increase the refcount on pages. An increased
   refcount guarantees page existence regardless of the vms unmapping
   actions until the logic in the notifier call decides to drop a page.

2. GRU style external mmus that rely on the Linux page table for TLB lookups.
   These cannot track pages that are externally references.
   TLB entries can only be evicted as necessary.


Callbacks are registered with an mm_struct from a device drivers using
mmu_notifier_register. When the VM removes pages (or restricts
permissions on pages) then callbacks are triggered

The VM holds spinlocks in order to walk reverse maps in rmap.c. The single
page callback invalidate_page() is therefore always run with
spinlocks held (which limits what can be done in the callbacks).

The invalidate_range_start/end callbacks can be run in atomic as well as
sleepable contexts. A flag is passed to indicate an atomic context.
The notifier may decide to defer actions if the context is atomic.

Pages must be marked dirty if dirty bits are found to be set in
the external ptes.

Requirements on synchronization within the driver:

 Multiple invalidate_range_begin/ends may be nested or called
 concurrently. That is legit. However, no new external references
 may be established as long as any invalidate_xxx is running or as long
 as any invalidate_range_begin() and has not been completed through a
 corresponding call to invalidate_range_end().

 Locking within the notifier callbacks needs to serialize events
 correspondingly. One simple implementation would be the use of a spinlock
 that needs to be acquired for access to the page table or tlb managed by
 the driver. A rw lock could be used to allow multiplel concurrent 
invalidates
 to run but then the driver needs to have additional internal 
synchronization
 for access to hardware resources.

 If all invalidate_xx notifier calls take the driver lock then it is 
possible
 to run follow_page() under the same lock. The lock can then guarantee
 that no page is removed and provides an additional existence guarantee
 of the page independent of the page count.

 invalidate_range_begin() must clear all references in the range
 and stop the establishment of new references.

 invalidate_range_end() reenables the establishment of references.
 The atomic paramater passed to invalidatge_range_xx indicates that the 
function
 is called in an atomic context. We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

---
 include/linux/mm_types.h |8 +
 include/linux/mmu_notifier.h |  179 +++
 kernel/fork.c|2 
 mm/Kconfig   |4 
 mm/Makefile  |1 
 mm/mmap.c|2 
 mm/mmu_notifier.c|   76 ++
 7 files changed, 272 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-31 19:55:46.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-31 19:59:51.0 -0800
@@ -153,6 +153,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +225,8 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+
+   struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-31 20:56:03.0 
-0800
@@ -0,0 +1,179 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or indivi

[patch 3/4] mmu_notifier: invalidate_page callbacks

2008-01-31 Thread Christoph Lameter

Two callbacks to remove individual pages as done in rmap code

invalidate_page()

Called from the inner loop of rmap walks to invalidate pages.

age_page()

Called for the determination of the page referenced status.

If we do not care about page referenced status then an age_page callback
may be be omitted.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Robin Holt <[EMAIL PROTECTED]>

---
 mm/rmap.c |   13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2008-01-31 19:55:45.0 -0800
+++ linux-2.6/mm/rmap.c 2008-01-31 20:28:35.0 -0800
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -284,7 +285,8 @@ static int page_referenced_one(struct pa
if (!pte)
goto out;
 
-   if (ptep_clear_flush_young(vma, address, pte))
+   if (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address))
referenced++;
 
/* Pretend the page is referenced if the task has the
@@ -434,6 +436,7 @@ static int page_mkclean_one(struct page 
 
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -677,7 +680,8 @@ static int try_to_unmap_one(struct page 
 * skipped over this mm) then we should reactivate it.
 */
if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-   (ptep_clear_flush_young(vma, address, pte {
+   (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address {
ret = SWAP_FAIL;
goto out_unmap;
}
@@ -685,6 +689,7 @@ static int try_to_unmap_one(struct page 
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
 
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -809,12 +814,14 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));
 
-   if (ptep_clear_flush_young(vma, address, pte))
+   if (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address))
continue;
 
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
 
/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/4] mmu_notifier: Callbacks to invalidate address ranges

2008-01-31 Thread Christoph Lameter

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

invalidate_range_begin/end() is frequently called with only mmap_sem
held. If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible.

In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).

do_wp_page(): We hold off new references while update the pte.

xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
Signed-off-by: Robin Holt <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/filemap_xip.c |5 +
 mm/fremap.c  |3 +++
 mm/hugetlb.c |3 +++
 mm/memory.c  |   24 ++--
 mm/mmap.c|2 ++
 mm/mremap.c  |7 ++-
 6 files changed, 41 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-01-31 20:56:03.0 -0800
+++ linux-2.6/mm/fremap.c   2008-01-31 20:59:14.0 -0800
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -211,7 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}
 
+   mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
+   mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-31 20:56:03.0 -0800
+++ linux-2.6/mm/memory.c   2008-01-31 20:59:14.0 -0800
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -601,6 +602,9 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+   if (is_cow_mapping(vma->vm_flags))
+   mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
@@ -611,6 +615,11 @@ int copy_page_range(struct mm_struct *ds
vma, addr, next))
return -ENOMEM;
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+   if (is_cow_mapping(vma->vm_flags))
+   mmu_notifier(invalidate_range_end, src_mm,
+   vma->vm_start, end, 0);
+
return 0;
 }
 
@@ -883,13 +892,16 @@ unsigned long zap_page_range(struct vm_a
struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
+   int atomic = details ? (details->i_mmap_lock != 0) : 0;
 
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+   mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range_end, mm, address, end, atomic);
return end;
 }
 
@@ -1318,7 +1330,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;
 
@@ -1352,6 +1364,7 @@ int remap_pfn_range(struct vm_area_struc
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
+   mmu_notifier(invalidate_range_begin, mm, start, end, 0);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1359,6 +1372,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range_end, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1442,10 +1456,11 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned lon

[patch 4/4] mmu_notifier: Support for driverws with revers maps (f.e. for XPmem)

2008-01-31 Thread Christoph Lameter

Support for an additional 3rd class of users of mmu_notifier.

These special additional callbacks are required because XPmem does
use its own rmap (multiple processes on a serires of remote Linux instances
may be accessing the memory of a process). XPmem may have to send out
notifications to remote Linux instances and receive confirmation before a
page can be freed.

So we handle this like an additional Linux reverse map that is walked after
the existing rmaps have been walked. We leave the walking to the driver that
is then able to use something else than a spinlock to walk its reverse
maps. So we can actually call the driver without holding spinlocks.

However, we cannot determine the mm_struct that a page belongs to. That
will have to be determined by the device driver. Therefore we need to
have a global list of reverse map callbacks.

We add another pageflag (PageExternalRmap) that is set if a page has
been remotely mapped (f.e. by a process from another Linux instance).
We can then only perform the callbacks for pages that are actually in
remote use.

Rmap notifiers need an extra page bit and are only available
on 64 bit platforms. This functionality is not available on 32 bit!

A notifier that uses the reverse maps callbacks does not need to provide
the invalidate_page() methods that are called when locks are held.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mmu_notifier.h |   70 +--
 include/linux/page-flags.h   |   11 ++
 mm/mmu_notifier.c|   36 +-
 mm/rmap.c|9 +
 4 files changed, 123 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===
--- linux-2.6.orig/include/linux/page-flags.h   2008-01-31 20:56:03.0 
-0800
+++ linux-2.6/include/linux/page-flags.h2008-01-31 21:00:40.0 
-0800
@@ -105,6 +105,7 @@
  * 64 bit  |   FIELDS | ?? FLAGS |
  * 6332  0
  */
+#define PG_external_rmap   30  /* Page has external rmap */
 #define PG_uncached31  /* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,16 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)  set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+   &(page)->flags)
+#else
+#define ClearPageExternalRmap(page) do {} while (0)
+#define PageExternalRmap(page) 0
+#endif
+
 struct page;   /* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-31 20:58:05.0 
-0800
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-31 21:00:40.0 
-0800
@@ -23,6 +23,18 @@
  * where sleeping is allowed or in atomic contexts. A flag is passed
  * to indicate an atomic context.
  *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we are not in a strictly
+ * atomic context (but we may be in a PF_MEMALLOC context if the
+ * notifier is called from reclaim code) and are able to sleep.
+ *
+ * Rmap notifiers need an extra page bit and are only available
+ * on 64 bit platforms.
+ *
  * Pages must be marked dirty if dirty bits are found to be set in
  * the external ptes.
  */
@@ -89,8 +101,26 @@ struct mmu_notifier_ops {
 int atomic);
 
void (*invalidate_range_end)(struct mmu_notifier *mn,
-unsigned long stat, unsigned long end,
-struct mm_struct *mm, int atomic);
+struct mm_struct *mm,
+unsigned long start, unsigned long end,
+int atomic);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+   struct hlist_node hlist;
+   const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+   /*
+* Called with the page lock held after ptes are modified or removed
+* so that a subsystem with its own rmap's can remove remote ptes
+* mapping a pa

Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges

2008-02-01 Thread Christoph Lameter

On Fri, 1 Feb 2008, Robin Holt wrote:

> Maybe I haven't looked closely enough, but let's start with some common
> assumptions.  Looking at do_wp_page from 2.6.24 (I believe that is what
> my work area is based upon).  On line 1559, the function begins being
> declared.

Aah I looked at the wrong file.

> On lines 1614 and 1630, we do "goto unlock" where the _end callout is
> soon made.  The _begin callout does not come until after those branches
> have been taken (occurs on line 1648).

There are actually two cases...

---
 mm/memory.c |   11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-02-01 11:04:21.0 -0800
+++ linux-2.6/mm/memory.c   2008-02-01 11:12:12.0 -0800
@@ -1611,8 +1611,10 @@ static int do_wp_page(struct mm_struct *
page_table = pte_offset_map_lock(mm, pmd, address,
 &ptl);
page_cache_release(old_page);
-   if (!pte_same(*page_table, orig_pte))
-   goto unlock;
+   if (!pte_same(*page_table, orig_pte)) {
+   pte_unmap_unlock(page_table, ptl);
+   goto check_dirty;
+   }
 
page_mkwrite = 1;
}
@@ -1628,7 +1630,8 @@ static int do_wp_page(struct mm_struct *
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
-   goto unlock;
+   pte_unmap_unlock(page_table, ptl);
+   goto check_dirty;
}
 
/*
@@ -1684,10 +1687,10 @@ gotten:
page_cache_release(new_page);
if (old_page)
page_cache_release(old_page);
-unlock:
pte_unmap_unlock(page_table, ptl);
mmu_notifier(invalidate_range_end, mm,
address, address + PAGE_SIZE, 0);
+check_dirty:
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] mmu_notifier: Core code

2008-02-01 Thread Christoph Lameter

On Fri, 1 Feb 2008, Robin Holt wrote:

> OK.  Now that release has been moved, I think I agree with you that the
> down_write(mmap_sem) can be used as our lock again and still work for
> Jack.  I would like a ruling from Jack as well.

Talked to Jack last night and he said its okay.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges

2008-02-01 Thread Christoph Lameter

Argh. Did not see this soon enougn. Maybe this one is better since it 
avoids the additional unlocks?

On Fri, 1 Feb 2008, Robin Holt wrote:

> do_wp_page can reach the _end callout without passing the _begin
> callout.  This prevents making the _end unles the _begin has also
> been made.
> 
> Index: mmu_notifiers-cl-v5/mm/memory.c
> ===
> --- mmu_notifiers-cl-v5.orig/mm/memory.c  2008-02-01 04:44:03.0 
> -0600
> +++ mmu_notifiers-cl-v5/mm/memory.c   2008-02-01 04:46:18.0 -0600
> @@ -1564,7 +1564,7 @@ static int do_wp_page(struct mm_struct *
>  {
>   struct page *old_page, *new_page;
>   pte_t entry;
> - int reuse = 0, ret = 0;
> + int reuse = 0, ret = 0, invalidate_started = 0;
>   int page_mkwrite = 0;
>   struct page *dirty_page = NULL;
>  
> @@ -1649,6 +1649,8 @@ gotten:
>  
>   mmu_notifier(invalidate_range_begin, mm, address,
>   address + PAGE_SIZE, 0);
> + invalidate_started = 1;
> +
>   /*
>* Re-check the pte - we dropped the lock
>*/
> @@ -1687,7 +1689,8 @@ gotten:
>   page_cache_release(old_page);
>  unlock:
>   pte_unmap_unlock(page_table, ptl);
> - mmu_notifier(invalidate_range_end, mm,
> + if (invalidate_started)
> + mmu_notifier(invalidate_range_end, mm,
>   address, address + PAGE_SIZE, 0);
>   if (dirty_page) {
>   if (vma->vm_file)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Extending mmu_notifiers to handle __xip_unmap in a sleepable context?

2008-02-01 Thread Christoph Lameter

On Fri, 1 Feb 2008, Robin Holt wrote:

> Currently, it is calling mmu_notifier _begin and _end under the
> i_mmap_lock.  I _THINK_ the following will make it so we could support
> __xip_unmap (although I don't recall ever seeing that done on ia64 and
> don't even know what the circumstances are for its use).

Its called under lock yes.

The problem with this fix is that we currently have the requirement that 
the rmap invalidate_all call requires the pagelock to be held. That is not 
the case here. So I used _begin/_end to skirt the issue.

If you do not need the Pagelock to be held (it holds off modifications on 
the page!) then we are fine.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-02-01 Thread Christoph Lameter

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> Note that my #v5 doesn't require to increase the page count all the
> time, so GRU will work fine with #v5.

But that comes with the cost of firing invalidate_page for every page 
being evicted. In order to make your single invalidate_range work without 
it you need to hold a refcount on the page.

> invalidate_page[s] is always called before the page is freed. This
> will require modifications to the tlb flushing code logic to take
> advantage of _pages in certain places. For now it's just safe.

Yes so your invalidate_range is still some sort of dysfunctional 
optimization? Gazillions of invalidate_page's will have to be executed 
when tearing down large memory areas.

> > How does KVM insure the consistency of the shadow page tables? Atomic ops?
> 
> A per-VM mmu_lock spinlock is taken to serialize the access, plus
> atomic ops for the cpu.

And that would not be enough to hold of new references? With small tweaks 
this should work with a common scheme. We could also redefine the role 
of _start and _end slightly to just require that the refs are removed when 
_end completes. That would allow the KVM page count ref to work as is now 
and would avoid the individual invalidate_page() callouts.

> > The GRU has no page table on its own. It populates TLB entries on demand 
> > using the linux page table. There is no way it can figure out when to 
> > drop page counts again. The invalidate calls are turned directly into tlb 
> > flushes.
> 
> Yes, this is why it can't serialize follow_page with only the PT lock
> with your patch. KVM may do it once you add start,end to range_end
> only thanks to the additional pin on the page.

Right but that pin requires taking a refcount which we cannot do.

Frankly this looks as if this is a solution that would work only for KVM.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges

2008-02-01 Thread Christoph Lameter

On Fri, 1 Feb 2008, Robin Holt wrote:

> We are getting this callout when we transition the pte from a read-only
> to read-write.  Jack and I can not see a reason we would need that
> callout.  It is causing problems for xpmem in that a write fault goes
> to get_user_pages which gets back to do_wp_page that does the callout.

Right. You placed it there in the first place. So we can drop the code 
from do_wp_page?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges

2008-02-01 Thread Christoph Lameter

On Fri, 1 Feb 2008, Robin Holt wrote:

> On Fri, Feb 01, 2008 at 03:19:32PM -0800, Christoph Lameter wrote:
> > On Fri, 1 Feb 2008, Robin Holt wrote:
> > 
> > > We are getting this callout when we transition the pte from a read-only
> > > to read-write.  Jack and I can not see a reason we would need that
> > > callout.  It is causing problems for xpmem in that a write fault goes
> > > to get_user_pages which gets back to do_wp_page that does the callout.
> > 
> > Right. You placed it there in the first place. So we can drop the code 
> > from do_wp_page?
> 
> No, we need a callout when we are becoming more restrictive, but not
> when becoming more permissive.  I would have to guess that is the case
> for any of these callouts.  It is for both GRU and XPMEM.  I would
> expect the same is true for KVM, but would like a ruling from Andrea on
> that.

do_wp_page is entered when the pte shows that the page is not writeable 
and it makes the page writable in some situations. Then we do not 
invalidate the remote reference.

However, when we do COW then a *new* page is put in place of the existing 
readonly page. At that point we need to remove the remote pte that is 
readonly. Then we install a new pte pointing to a *different* page that is 
writable.

Are you saying that you get the callback when transitioning from a read 
only to a read write pte on the *same* page?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 22/27] quicklist: Set tlb->need_flush if pages are remaining in quicklist 0

2008-02-01 Thread Christoph Lameter

NO! Wrong fix. Was dropped from mainline.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 22/27] quicklist: Set tlb->need_flush if pages are remaining in quicklist 0

2008-02-01 Thread Christoph Lameter

On Fri, 1 Feb 2008, Justin M. Forbes wrote:

> 
> On Fri, 2008-02-01 at 16:39 -0800, Christoph Lameter wrote:
> > NO! Wrong fix. Was dropped from mainline.
> 
> What is the right fix for the OOM issues with 2.6.22? Perhaps
> http://marc.info/?l=linux-mm&m=119973653803451&w=2 should be added to
> the queue in its place?  The OOM issue in 2.6.22 is real, and should be
> addressed.

Indeed that is the right fix.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] mmu_notifier: Core code

2008-02-04 Thread Christoph Lameter

On Sun, 3 Feb 2008, Andrea Arcangeli wrote:

> On Thu, Jan 31, 2008 at 07:58:40PM -0800, Christoph Lameter wrote:
> > Ok. Andrea wanted the same because then he can void the begin callouts.
> 
> Exactly. I hope the page-pin will avoid me having to serialize the KVM
> page fault against the start/end critical section.
> 
> BTW, I wonder if the start/end critical section API is intended to
> forbid scheduling inside it. In short I wonder if GRU can is allowed
> to take a spinlock in _range_start as last thing before returning, and
> to release that same spinlock in _range_end as first thing, and not to
> be forced to use a mutex.

_begin/end encloses code that may sleep and _begin/_end itself may sleep. 
So a semaphore may work but not a spinlock.
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-02-04 Thread Christoph Lameter

On Sun, 3 Feb 2008, Andrea Arcangeli wrote:

> > Right but that pin requires taking a refcount which we cannot do.
> 
> GRU can use my patch without the pin. XPMEM obviously can't use my
> patch as my invalidate_page[s] are under the PT lock (a feature to fit
> GRU/KVM in the simplest way), this is why an incremental patch adding
> invalidate_range_start/end would be required to support XPMEM too.

Doesnt the kernel in some situations release the page before releasing the 
pte lock? Then there will be an external pte pointing to a page that may 
now have a different use. Its really bad if that pte does allow writes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.

2008-02-04 Thread Christoph Lameter

On Sat, 2 Feb 2008, Andi Kleen wrote:

> To be honest I've never tried seriously to make 32bit NUMA policy
> (with highmem) work well; just kept it at a "should not break"
> level. That is because with highmem the kernel's choices at 
> placing memory are seriously limited anyways so I doubt 32bit
> NUMA will ever work very well.

Memory policies do not work reliably with config highmem (I have never 
seen such usage because large memory systems are typically 64 bit 
which have no highmem, but there are some 32bit numa uses of HIGHMEM) 

Memory policies are only applied to the highest zone. So if a system has 
highmem on some nodes and not on the others then policies will only be 
applied if allocations happen to occur on the highmem nodes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git pull] SLUB updates for 2.6.25

2008-02-04 Thread Christoph Lameter

Updates for slub are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git slub-linus

Christoph Lameter (5):
  SLUB: Fix sysfs refcounting
  Move count_partial before kmem_cache_shrink
  SLUB: rename defrag to remote_node_defrag_ratio
  Add parameter to add_partial to avoid having two functions
  Explain kmem_cache_cpu fields

Harvey Harrison (1):
  slub: fix shadowed variable sparse warnings

Pekka Enberg (1):
  SLUB: Fix coding style violations

root (1):
  SLUB: Do not upset lockdep

 include/linux/slub_def.h |   15 ++--
 mm/slub.c|  182 +-
 2 files changed, 108 insertions(+), 89 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Christoph Lameter

Hope to have the slub-mm repository setup tonight which will simplify 
things for the future. Hope you still remember 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Christoph Lameter

On Tue, 5 Feb 2008, Nick Piggin wrote:

> I'm sure it could have an effect. But why is the common case in SLUB
> for the cacheline to be bouncing? What's the benchmark? What does SLAB
> do in that benchmark, is it faster than SLUB there? What does the
> non-atomic bit unlock do to Willy's database workload?

I saw this in tbench and the test was done on a recent quad core Intel 
cpu. SLAB is 1 - 4% faster than SLUB on the 2 x Quad setup that I am using 
here to test. Not sure if what I think it is is really the issue. I added 
some statistics to SLUB to figure out what exactly is going on and it 
seems that the remote handoff may not be the issue:

Name   ObjectsAlloc Free   %Fast
skbuff_fclone_cache 33 111953835 111953835  99  99
:192  2666  5283688  5281047  99  99
:0001024   849  5247230  5246389  83  83
vm_area_struct1349   119642   118355  91  22
:0004096156675366751  98  98
:064  20672529723383  98  78
dentry   102592863518464  91  45
:080 1100418950 8089  98  98
:096  17031235810784  99  98
:128   76210582 9875  94  18
:512   184 9807 9647  95  81
:0002048   479 9669 9195  83  65
anon_vma   777 9461 9002  99  71
kmalloc-8 6492 9981 5624  99  97
:768   258 7174 6931  58  15

slabinfo -a | grep 000192
:192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP 
request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili

Likely skbuff_head_cache.

slabinfo skbuff_fclone_cache

Slab Perf Counter   Alloc Free %Al %Fr
--
Fastpath 111953360 111946981  99  99
Slowpath 1044 7423   0   0
Page Alloc272  264   0   0
Add partial25  325   0   0
Remove partial 86  264   0   0
RemoteObj/SlabFrozen  350 4832   0   0
Total111954404 111954404

Flushes   49 Refill0
Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)

There is only minimal handoff here.

skbuff_head_cache:

Slab Perf Counter   Alloc Free %Al %Fr
--
Fastpath  5297262  5259882  99  99
Slowpath 447739586   0   0
Page Alloc937  824   0   0
Add partial 0 2515   0   0
Remove partial   1691  824   0   0
RemoteObj/SlabFrozen 2621 9684   0   0
Total 5301739  5299468

Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)

Some more handoff but still basically the same.

Need to dig into this some more.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Christoph Lameter

On Tue, 5 Feb 2008, Nick Piggin wrote:

> > erk, sorry, I misremembered.   I was about to merge all the patches we
> > weren't going to merge.  oops.
> 
> While you're there, can you drop the patch(es?) I commented on
> and didn't get an answer to. Like the ones that open code their
> own locking primitives and do risky looking things with barriers
> to boot...

That patch will be moved to a special archive for 
microbenchmarks. It shows the same issues like the __unlock patch.
 
> Also, WRT this one:
> slub-use-non-atomic-bit-unlock.patch
> 
> This is strange that it is unwanted. Avoiding atomic operations
> is a pretty good idea. The fact that it appears to be slower on
> some microbenchmark on some architecture IMO either means that
> their __clear_bit_unlock or the CPU isn't implemented so well...

Its slower on x86_64 and that is a pretty important arch. So 
I am to defer this until we have analyzed the situation some more. Could 
there be some effect of atomic ops on the speed with which a cacheline is 
released?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Christoph Lameter

On Tue, 5 Feb 2008, Nick Piggin wrote:

> Ok. But the approach is just not so good. If you _really_ need something
> like that and it is a win over the regular non-atomic unlock, then you
> just have to implement it as a generic locking / atomic operation and
> allow all architectures to implement the optimal (and correct) memory
> barriers.

Assuming this really gives a benefit on several benchmarks then we need 
to think about how to do this some more. Its a rather strange form of 
locking.

Basically you lock the page with a single atomic operation that sets 
PageLocked and retrieves the page flags. Then we shovel the page state 
around a couple of functions in a register and finally store the page 
state back which at the same time unlocks the page. So two memory 
references with one of them being atomic with none in between. We have 
nothing that can do something like that right now.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Christoph Lameter

On Tue, 5 Feb 2008, Nick Piggin wrote:

> Anyway, not saying the operations are useless, but they should be
> made available to core kernel and implemented per-arch. (if they are
> found to be useful)

The problem is to establish the usefulness. These measures may bring 1-2% 
in a pretty unstable operation mode assuming that the system is doing 
repetitive work. The micro optimizations seem to be often drowned out 
by small other changes to the system.

There is the danger that a gain is seen that is not due to the patch but 
due to other changes coming about because code is moved since patches 
change execution paths.

Plus they may be only possible on a specific architecture. I know that our 
IA64 hardware has special measures ensuring certain behavior of atomic ops 
etc, I guess Intel has similar tricks up their sleeve. At 8p there are 
likely increasing problems with lock starvation where your ticketlock 
helps. That is why I thought we better defer the stuff until there is some 
more evidence that these are useful.

I got particularly nervous about these changes after I saw small 
performance drops due to the __unlock patch on the dual quad. That should 
have been a consistent gain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-02-04 Thread Christoph Lameter

On Tue, 5 Feb 2008, Andrea Arcangeli wrote:

> On Mon, Feb 04, 2008 at 11:09:01AM -0800, Christoph Lameter wrote:
> > On Sun, 3 Feb 2008, Andrea Arcangeli wrote:
> > 
> > > > Right but that pin requires taking a refcount which we cannot do.
> > > 
> > > GRU can use my patch without the pin. XPMEM obviously can't use my
> > > patch as my invalidate_page[s] are under the PT lock (a feature to fit
> > > GRU/KVM in the simplest way), this is why an incremental patch adding
> > > invalidate_range_start/end would be required to support XPMEM too.
> > 
> > Doesnt the kernel in some situations release the page before releasing the 
> > pte lock? Then there will be an external pte pointing to a page that may 
> > now have a different use. Its really bad if that pte does allow writes.
> 
> Sure the kernel does that most of the time, which is for example why I
> had to use invalidate_page instead of invalidate_pages inside
> zap_pte_range. Zero problems with that (this is also the exact reason
> why I mentioned the tlb flushing code would need changes to convert
> some page in pages).

Zero problems only if you find having a single callout for every page 
acceptable. So the invalidate_range in your patch is only working 
sometimes. And even if it works then it has to be used on 2M range. Seems 
to be a bit fragile and needlessly complex.

"conversion of some page in pages"? A proposal to defer the freeing of the 
pages until after the pte_unlock?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

SLUB: Support for statistics to help analyze allocator behavior

2008-02-04 Thread Christoph Lameter

The statistics provided here allow the monitoring of allocator behavior
at the cost of some (minimal) loss of performance. Counters are placed in
SLUB's per cpu data structure that is already written to by other code.
 
The per cpu structure may be extended by the statistics to be more than 
one cacheline which will increase the cache footprint of SLUB.

That is why there is a compile option to enable/disable the inclusion of
the statistics module.

The slabinfo tool is enhanced to support these statistics via two options:

-D  Switches the line of information displayed for a slab from size
mode to activity mode.

-A  Sorts the slabs displayed by activity. This allows the display of
the slabs most important to the performance of a certain load.

-r  Report option will report detailed statistics on

Example (tbench load):

slabinfo -AD->Shows the most active slabs

Name   ObjectsAlloc Free   %Fast
skbuff_fclone_cache 33 111953835 111953835  99  99
:192  2666  5283688  5281047  99  99
:0001024   849  5247230  5246389  83  83
vm_area_struct1349   119642   118355  91  22
:0004096156675366751  98  98
:064  20672529723383  98  78
dentry   102592863518464  91  45
:080 1100418950 8089  98  98
:096  17031235810784  99  98
:128   76210582 9875  94  18
:512   184 9807 9647  95  81
:0002048   479 9669 9195  83  65
anon_vma   777 9461 9002  99  71
kmalloc-8 6492 9981 5624  99  97
:768   258 7174 6931  58  15

So the skbuff_fclone_cache is of highest importance for the tbench load.
Pretty high load on the 192 sized slab. Look for the aliases

slabinfo -a | grep 000192
:192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP 
request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili

Likely skbuff_head_cache.


Looking into the statistics of the skbuff_fclone_cache is possible through

slabinfo skbuff_fclone_cache->-r option implied if cache name is mentioned


 Usual output ...

Slab Perf Counter   Alloc Free %Al %Fr
--
Fastpath 111953360 111946981  99  99
Slowpath 1044 7423   0   0
Page Alloc272  264   0   0
Add partial25  325   0   0
Remove partial 86  264   0   0
RemoteObj/SlabFrozen  350 4832   0   0
Total111954404 111954404

Flushes   49 Refill0
Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)

Looks good because the fastpath is overwhelmingly taken.


skbuff_head_cache:

Slab Perf Counter   Alloc Free %Al %Fr
--
Fastpath  5297262  5259882  99  99
Slowpath 447739586   0   0
Page Alloc937  824   0   0
Add partial 0 2515   0   0
Remove partial   1691  824   0   0
RemoteObj/SlabFrozen 2621 9684   0   0
Total 5301739  5299468

Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)

Less good because the proportion of slowpath is a bit higher here.



Descriptions of the output:

Total:  The total number of allocation and frees that occurred for a
slab

Fastpath:   The number of allocations/frees that used the fastpath.

Slowpath:   Other allocations

Page Alloc: Number of calls to the page allocator as a result of slowpath
processing

Add Partial:Number of slabs added to the partial list through free or
alloc (occurs during cpuslab flushes)

Remove Partial: Number of slabs removed from the partial list as a result of
allocations retrieving a partial slab or by a free freeing
the last object of a slab.

RemoteObj/Froz: How many times were remotely freed object encountered when a
slab was about to be deactivated. Frozen: How many times was
free able to skip list processing because the slab was in use
as the cpuslab of another processor.

Flushes:Number of times the cpuslab was flushed on request
(kmem_cache_shrink, may result from races in __slab_alloc)

Refill: Number of times we were able to refill the cpuslab from
remotely freed objects for the same slab.

Deactivate: Statistics how slabs were deactivated. Shows how they were
put onto the partial list.


Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 Documentation/vm/slabinfo.c |  149 +++

Re: SLUB: Support for statistics to help analyze allocator behavior

2008-02-05 Thread Christoph Lameter

On Tue, 5 Feb 2008, Pekka J Enberg wrote:

> Hi Christoph,
> 
> On Mon, 4 Feb 2008, Christoph Lameter wrote:
> > The statistics provided here allow the monitoring of allocator behavior
> > at the cost of some (minimal) loss of performance. Counters are placed in
> > SLUB's per cpu data structure that is already written to by other code.
> 
> Looks good but I am wondering if we want to make the statistics per-CPU so 
> that we can see the kmalloc/kfree ping-pong of, for example, hackbench 

We could do that Any idea how to display that kind of information 
in a meaningful way. Parameter conventions for slabinfo?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: Support for statistics to help analyze allocator behavior

2008-02-05 Thread Christoph Lameter

On Tue, 5 Feb 2008, Pekka J Enberg wrote:

> Heh, sure, but it's not exported to userspace which is required for 
> slabinfo to display the statistics.

Well we could do the same as for numa stats. Output the global count and 
then add

c=count

?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.

2008-02-05 Thread Christoph Lameter

Could we focus on the problem instead of discussion of new patches under 
development? Can we confirm that what Kosaki sees is a bug?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v5

2008-02-05 Thread Christoph Lameter

On Tue, 5 Feb 2008, Andrea Arcangeli wrote:

> given I never allow a coherency-loss between two threads that will
> read/write to two different physical pages for the same virtual
> adddress in remap_file_pages).

The other approach will not have any remote ptes at that point. Why would 
there be a coherency issue?

> In performance terms with your patch before GRU can run follow_page it
> has to take a mm-wide global mutex where each thread in all cpus will
> have to take it. That will trash on >4-way when the tlb misses start

No. It only has to lock the affected range. Remote page faults can occur 
while another part of the address space is being invalidated. The 
complexity of locking is up to the user of the mmu notifier. A simple 
implementation is satisfactory for the GRU right now. Should it become a 
problem then the lock granularity can be refined without changing the API.

> > "conversion of some page in pages"? A proposal to defer the freeing of the 
> > pages until after the pte_unlock?
> 
> There can be many tricks to optimize page in pages, but again munmap
> and do_exit aren't the interesting path to optimzie, nor for GRU nor
> for KVM so it doesn't matter right now.

Still not sure what we are talking about here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5044 matches

Mail list logo