Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance

Mathieu Desnoyers Tue, 10 Jul 2007 01:28:44 -0700

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Ok here is a replacement patch for the cmpxchg patch. Problems
> 
> 1. cmpxchg_local is not available on all arches. If we wanted to do
>    this then it needs to be universally available.
>


cmpxchg_local is not available on all archs, but local_cmpxchg is. It
expects a local_t type which is nothing else than a long. When the local
atomic operation is not more efficient or not implemented on a given
architecture, asm-generic/local.h falls back on atomic_long_t. If you
want, you could work on the local_t type, which you could cast from a
long to a pointer when you need so, since their size are, AFAIK, always
the same (and some VM code even assume this is always the case).

> 2. cmpxchg_local does generate the "lock" prefix. It should not do that.
>    Without fixes to cmpxchg_local we cannot expect maximum performance.
> 

Yup, see the patch I just posted for this.

> 3. The approach is x86 centric. It relies on a cmpxchg that does not
>    synchronize with memory used by other cpus and therefore is more
>    lightweight. As far as I know the IA64 cmpxchg cannot do that.
>    Neither several other processors. I am not sure how cmpxchgless
>    platforms would use that. We need a detailed comparison of
>    interrupt enable /disable vs. cmpxchg cycle counts for cachelines in
>    the cpu cache to evaluate the impact that such a change would have.
> 
>    The cmpxchg (or its emulation) does not need any barriers since the
>    accesses can only come from a single processor. 
> 

Yes, expected improvements goes as follow:
x86, x86_64 : must faster due to non-LOCKed cmpxchg
alpha: should be faster due to memory barrier removal
mips: memory barriers removed
powerpc 32/64: memory barriers removed

On other architectures, either there is no better implementation than
the standard atomic cmpxchg or it just has not been implemented.

I guess that a test series that would tell us how must improvement is
seen on the optimized architectures (local cmpxchg vs interrupt
enable/disable) and also what effect the standard cmpxchg has compared
to interrupt disable/enable on the architectures where we can't do
better than the standard cmpxchg will tell us if it is an interesting
way to go.  I would be happy to do these tests, but I don't have the
hardware handy. I provide a test module to get these characteristics
from various architectures in this email.

> Mathieu measured a significant performance benefit coming from not using
> interrupt enable / disable.
> 
> Some rough processor cycle counts (anyone have better numbers?)
> 
>       STI     CLI     CMPXCHG
> IA32  36      26      1 (assume XCHG == CMPXCHG, sti/cli also need stack 
> pushes/pulls)
> IA64  12      12      1 (but ar.ccv needs 11 cycles to set comparator,
>                       need register moves to preserve processors flags)
> 

The measurements I get (in cycles):

             enable interrupts (STI)   disable interrupts (CLI)   local CMPXCHG
IA32 (P4)    112                        82                         26
x86_64 AMD64 125                       102                         19

> Looks like STI/CLI is pretty expensive and it seems that we may be able to
> optimize the alloc / free hotpath quite a bit if we could drop the 
> interrupt enable / disable. But we need some measurements.
> 
> 
> Draft of a new patch:
> 
> SLUB: Single atomic instruction alloc/free using cmpxchg_local
> 
> A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg
> is optimal to allow operations on per cpu freelist. We can stay on one
> processor by disabling preemption() and allowing concurrent interrupts
> thus avoiding the overhead of disabling and enabling interrupts.
> 
> Pro:
>       - No need to disable interrupts.
>       - Preempt disable /enable vanishes on non preempt kernels
> Con:
>         - Slightly complexer handling.
>       - Updates to atomic instructions needed
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 

Test local cmpxchg vs int disable/enable. Please run on a 2.6.22 kernel
(or recent 2.6.21-rcX-mmX) (with my cmpxchg local fix patch for x86_64).
Make sure the TSC reads (get_cycles()) are reliable on your platform.

Mathieu

/* test-cmpxchg-nolock.c
 *
 * Compare local cmpxchg with irq disable / enable.
 */

#include <linux/jiffies.h>
#include <linux/compiler.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/calc64.h>
#include <asm/timex.h>
#include <asm/system.h>

#define NR_LOOPS 20000

int test_val = 0;

static void do_test_cmpxchg(void)
{
        int ret;
        long flags;
        unsigned int i;
        cycles_t time1, time2, time;
        long rem;

        local_irq_save(flags);
        preempt_disable();
        time1 = get_cycles();
        for (i = 0; i < NR_LOOPS; i++) {
                ret = cmpxchg_local(&test_val, 0, 0);
        }
        time2 = get_cycles();
        local_irq_restore(flags);
        preempt_enable();
        time = time2 - time1;

        printk(KERN_ALERT "test results: time for non locked cmpxchg\n");
        printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
        printk(KERN_ALERT "total time: %llu\n", time);
        time = div_long_long_rem(time, NR_LOOPS, &rem);
        printk(KERN_ALERT "-> non locked cmpxchg takes %llu cycles\n", time);
        printk(KERN_ALERT "test end\n");
}

/*
 * This test will have a higher standard deviation due to incoming interrupts.
 */
static void do_test_enable_int(void)
{
        long flags;
        unsigned int i;
        cycles_t time1, time2, time;
        long rem;

        local_irq_save(flags);
        preempt_disable();
        time1 = get_cycles();
        for (i = 0; i < NR_LOOPS; i++) {
                local_irq_restore(flags);
        }
        time2 = get_cycles();
        local_irq_restore(flags);
        preempt_enable();
        time = time2 - time1;

        printk(KERN_ALERT "test results: time for enabling interrupts (STI)\n");
        printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
        printk(KERN_ALERT "total time: %llu\n", time);
        time = div_long_long_rem(time, NR_LOOPS, &rem);
        printk(KERN_ALERT "-> enabling interrupts (STI) takes %llu cycles\n",
                                        time);
        printk(KERN_ALERT "test end\n");
}

static void do_test_disable_int(void)
{
        unsigned long flags, flags2;
        unsigned int i;
        cycles_t time1, time2, time;
        long rem;

        local_irq_save(flags);
        preempt_disable();
        time1 = get_cycles();
        for ( i = 0; i < NR_LOOPS; i++) {
                local_irq_save(flags2);
        }
        time2 = get_cycles();
        local_irq_restore(flags);
        preempt_enable();
        time = time2 - time1;

        printk(KERN_ALERT "test results: time for disabling interrupts 
(CLI)\n");
        printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
        printk(KERN_ALERT "total time: %llu\n", time);
        time = div_long_long_rem(time, NR_LOOPS, &rem);
        printk(KERN_ALERT "-> disabling interrupts (CLI) takes %llu cycles\n",
                                time);
        printk(KERN_ALERT "test end\n");
}



static int ltt_test_init(void)
{
        printk(KERN_ALERT "test init\n");
        
        do_test_cmpxchg();
        do_test_enable_int();
        do_test_disable_int();
        return -EAGAIN; /* Fail will directly unload the module */
}

static void ltt_test_exit(void)
{
        printk(KERN_ALERT "test exit\n");
}

module_init(ltt_test_init)
module_exit(ltt_test_exit)

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("Cmpxchg local test");

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance

Reply via email to