* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local
> emulation. Results are not good:
>
Hi Christoph,
I tried to come up with a patch set implementing the basics of a new
critical section: local_enter(flags) and local_exit(flags)
On Tue, 2007-08-28 at 12:36 -0700, Christoph Lameter wrote:
> On Tue, 28 Aug 2007, Peter Zijlstra wrote:
>
> > On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> > > H. One wild idea would be to use a priority futex for the slab lock?
> > > That would make the slow paths interrupt
On Tue, 28 Aug 2007, Mathieu Desnoyers wrote:
> Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
> must always come with the acquire or release semantic. Is there any
> cmpxchg equivalent on ia64 that would be acquire and release semantic
> free ? This implicit memory orderin
On Tue, 28 Aug 2007, Peter Zijlstra wrote:
> On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> > H. One wild idea would be to use a priority futex for the slab lock?
> > That would make the slow paths interrupt safe without requiring interrupt
> > disable? Does a futex fit into t
Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
must always come with the acquire or release semantic. Is there any
cmpxchg equivalent on ia64 that would be acquire and release semantic
free ? This implicit memory ordering in the instruction seems to be
responsible for the sl
On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> H. One wild idea would be to use a priority futex for the slab lock?
> That would make the slow paths interrupt safe without requiring interrupt
> disable? Does a futex fit into the page struct?
Very much puzzled at what you propo
Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local
emulation. Results are not good:
slub/per cpu
1 times kmalloc(8)/kfree -> 105 cycles
1 times kmalloc(16)/kfree -> 104 cycles
1 times kmalloc(32)/kfree -> 105 cycles
1 times kmalloc(64)/kfree -> 104 cycles
1 ti
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> Hrm, I just want to certify one thing: A lot of code paths seems to go
> to the slow path without requiring cmpxchg_local to execute at all. So
> is the slow path more likely to be triggered by the (!object),
> (!node_match) tests or by these same te
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
>
> > > The slow path would require disable preemption and two interrupt disables.
> > If the slow path have to call new_slab, then yes. But it seems that not
> > every slow path must call it, so for the
H. One wild idea would be to use a priority futex for the slab lock?
That would make the slow paths interrupt safe without requiring interrupt
disable? Does a futex fit into the page struct?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> > The slow path would require disable preemption and two interrupt disables.
> If the slow path have to call new_slab, then yes. But it seems that not
> every slow path must call it, so for the other slow paths, only one
> interrupt disable would be
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
>
> > > a clean solution source code wise. It also minimizes the interrupt
> > > holdoff
> > > for the non-cmpxchg_local arches. However, it means that we will have to
> > > disable interrupts twice f
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> > a clean solution source code wise. It also minimizes the interrupt holdoff
> > for the non-cmpxchg_local arches. However, it means that we will have to
> > disable interrupts twice for the slow path. If that is too expensive then
> > we need a d
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> I think the simplest solution may be to leave slub as done in the patch
> that we developed last week. The arch must provide a cmpxchg_local that is
> performance wise the fastest possible. On x86 this is going to be the
> cmpxchg_local on others
I think the simplest solution may be to leave slub as done in the patch
that we developed last week. The arch must provide a cmpxchg_local that is
performance wise the fastest possible. On x86 this is going to be the
cmpxchg_local on others where cmpxchg is slower than interrupt
disable/enable
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
>
> > * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > > On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> > >
> > > > So, if the fast path can be done with a preempt off, it might be doable
> > > > to suf
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> >
> > > So, if the fast path can be done with a preempt off, it might be doable
> > > to suffer the slow path with a per cpu lock like that.
> >
> > Sad
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Peter Zijlstra wrote:
>
> > So, if the fast path can be done with a preempt off, it might be doable
> > to suffer the slow path with a per cpu lock like that.
>
> Sadly the cmpxchg_local requires local per cpu data access. Isnt
On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> So, if the fast path can be done with a preempt off, it might be doable
> to suffer the slow path with a per cpu lock like that.
Sadly the cmpxchg_local requires local per cpu data access. Isnt there
some way to make this less expensive on RT? Acessin
On Tue, 2007-08-21 at 16:14 -0700, Christoph Lameter wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Changed smp_rmb() for barrier(). We are not interested in read order
> > across cpus, what we want is to be ordered wrt local interrupts only.
> > barrier() is much cheaper than
Ok so we need this.
Fix up preempt checks.
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
mm/slub.c |4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
>
> > * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> > > {
> >
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
> > Then the thread could be preempted and rescheduled on a different cpu
> > between put_cpu and local_irq_save() which means that we loose the
> > state information of the kmem_cache_cpu structure.
> >
>
> Maybe am I misunderstanding something, bu
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> > {
> > void *prior;
> > void **object = (void *)x;
> > +
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> {
> void *prior;
> void **object = (void *)x;
> + unsigned long flags;
>
> + local_irq_save(flags
Here is the current cmpxchg_local version that I used for testing.
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
include/linux/slub_def.h | 10 +++---
mm/slub.c| 74 ---
2 files changed, 56 insertions(+), 28 deletions(-)
I can confirm Mathieus' measurement now:
Athlon64:
regular NUMA/discontig
1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) -> 79 cycles kfree -> 92 cycles
1 times kmalloc(16) -> 79 cycles kfree -> 93 cycles
1 times kmalloc(32) -> 88 cycles kfree -> 95 cycles
1 ti
Measurements on a AMD64 2.0 GHz dual-core
In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles improvement.
1. Kmalloc: Repeatedly allocate then free test
* cmpxchg_local sl
On Wed, Aug 22, 2007 at 09:45:33AM -0400, Mathieu Desnoyers wrote:
> Measurements on a AMD64 2.0 GHz dual-core
>
> In this test, we seem to remove 10 cycles from the kmalloc fast path.
> On small allocations, it gives a 14% performance increase. kfree fast
> path also seems to have a 10 cycles imp
On Tue, Aug 21, 2007 at 06:06:19PM -0700, Christoph Lameter wrote:
> Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz
Note the P4 is a extreme case in that "unusual" instructions are
quite slow (basically anything that falls out of the trace cache). Core2
tends to be mu
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > As I am going back through the initial cmpxchg_local implementation, it
> > seems like it was executing __slab_alloc() with preemption disabled,
> > which is wrong. new_slab() is not designed for t
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz
> (hyperthreading enabled). Test run with your module show only minor
> performance improvements and lots of regressions. So we must have
> cmpxchg_local to see any improve
Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz
(hyperthreading enabled). Test run with your module show only minor
performance improvements and lots of regressions. So we must have
cmpxchg_local to see any improvements? Some kind of a recent optimization
of cmpxchg p
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> >
> > The measurements I get (in cycles):
> > enable interrupts (STI) disable interrupts (CLI) local
> > CMPXCHG
> > IA32 (P4)11282 26
> >
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> As I am going back through the initial cmpxchg_local implementation, it
> seems like it was executing __slab_alloc() with preemption disabled,
> which is wrong. new_slab() is not designed for that.
The version I send you did not use preemption.
We
Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
>
> The measurements I get (in cycles):
> enable interrupts (STI) disable interrupts (CLI) local
> CMPXCHG
> IA32 (P4)11282 26
> x86_64 AMD64 125 102
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Rounding error.. you seem to round at 0.1ms, but I keep the values in
> > cycles. The times that you get (1.1ms) seems strangely higher than
> > mine, which are under 1000 cycles on a 3GHz sy
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> - Rounding error.. you seem to round at 0.1ms, but I keep the values in
> cycles. The times that you get (1.1ms) seems strangely higher than
> mine, which are under 1000 cycles on a 3GHz system (less than 333ns).
> I guess there is both a ms -
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > Are you running a UP or SMP kernel ? If you run a UP kernel, the
> > cmpxchg_local and cmpxchg are identical.
>
> UP.
>
> > Oh, and if you run your tests at boot time, the alternatives code may
>
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> Are you running a UP or SMP kernel ? If you run a UP kernel, the
> cmpxchg_local and cmpxchg are identical.
UP.
> Oh, and if you run your tests at boot time, the alternatives code may
> have removed the lock prefix, therefore making cmpxchg and cmp
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
> > shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
> > for the kmalloc/kfree pair (test 2).
>
> Hmmm
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> >
> > > - Changed smp_rmb() for barrier(). We are not interested in read order
> > > across cpus, what we want is to be ordered wrt local interrupts
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
> shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
> for the kmalloc/kfree pair (test 2).
H.. I wonder if the AMD processors simply do the same in eith
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> kmalloc(8)/kfree = 112 cycles
> kmalloc(16)/kfree = 103 cycles
> kmalloc(32)/kfree = 103 cycles
> kmalloc(64)/kfree = 103 cycles
> kmalloc(128)/kfree = 112 cycles
> kmalloc(256)/kfree = 111 cycles
> kmalloc(512)/kfree = 111 cycles
> kmalloc(1024)/kfr
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > SLUB Use cmpxchg() everywhere.
> >
> > It applies to "SLUB: Single atomic instruction alloc/free using
> > cmpxchg".
>
> > +++ slab/mm/slub.c 2007-08-20 18:42:28.0 -0400
> > @@ -1682,7 +
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> * cmpxchg_local Slub test
> kmalloc(8) = 83 cycleskfree = 363 cycles
> kmalloc(16) = 85 cycles kfree = 372 cycles
> kmalloc(32) = 92 cycles kfree = 377 cycles
> kmalloc(64) = 115 cycleskfree = 397 cycles
> kmal
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Changed smp_rmb() for barrier(). We are not interested in read order
> > across cpus, what we want is to be ordered wrt local interrupts only.
> > barrier() is much cheaper than a rmb().
>
>
Reformatting...
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> Hi Christoph,
>
> If you are interested in the raw numbers:
>
> The (very basic) test module follows. Make sure you change get_cycles()
> for get_cycles_sync() if you plan to run this on x86_64.
>
> (tests taken on a 3GHz Pentium
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> SLUB Use cmpxchg() everywhere.
>
> It applies to "SLUB: Single atomic instruction alloc/free using
> cmpxchg".
> +++ slab/mm/slub.c2007-08-20 18:42:28.0 -0400
> @@ -1682,7 +1682,7 @@ redo:
>
> object[c->offset] = freelist;
>
>
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> - Changed smp_rmb() for barrier(). We are not interested in read order
> across cpus, what we want is to be ordered wrt local interrupts only.
> barrier() is much cheaper than a rmb().
But this means a preempt disable is required. RT users do no
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Fixed an erroneous test in slab_free() (logic was flipped from the
> > original code when testing for slow path. It explains the wrong
> > numbers you have with big free).
>
> If you look
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> Therefore, in the test where we have separate passes for slub allocation
> and free, we hit mostly the slow path. Any particular reason for that ?
Maybe on SMP you are schedule to run on a different processor? Note that
I ran my tests at early boot
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> If you are interested in the raw numbers:
>
> The (very basic) test module follows. Make sure you change get_cycles()
> for get_cycles_sync() if you plan to run this on x86_64.
Which test is which? Would you be able to format this in a way that we
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> - Fixed an erroneous test in slab_free() (logic was flipped from the
> original code when testing for slow path. It explains the wrong
> numbers you have with big free).
If you look at the numbers that I posted earlier then you will see that
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> Ok, I played with your patch a bit, and the results are quite
> interesting:
>
...
> Summary:
>
> (tests repeated 1 times on a 3GHz Pentium 4)
> (kernel DEBUG menuconfig options are turned off)
> results are in cycles per iteration
> I did 2 ru
Hi Christoph,
If you are interested in the raw numbers:
The (very basic) test module follows. Make sure you change get_cycles()
for get_cycles_sync() if you plan to run this on x86_64.
(tests taken on a 3GHz Pentium 4)
* slub HEAD, test 1
[ 99.774699] SLUB Performance testing
[ 99.785431]
Ok, I played with your patch a bit, and the results are quite
interesting:
SLUB use cmpxchg_local
my changes:
- Fixed an erroneous test in slab_free() (logic was flipped from the
original code when testing for slow path. It explains the wrong
numbers you have with big free).
- Use cmpxchg_l
57 matches
Mail list logo