* Christoph Lameter ([EMAIL PROTECTED]) wrote: > Ok here is a replacement patch for the cmpxchg patch. Problems > > 1. cmpxchg_local is not available on all arches. If we wanted to do > this then it needs to be universally available. >
cmpxchg_local is not available on all archs, but local_cmpxchg is. It expects a local_t type which is nothing else than a long. When the local atomic operation is not more efficient or not implemented on a given architecture, asm-generic/local.h falls back on atomic_long_t. If you want, you could work on the local_t type, which you could cast from a long to a pointer when you need so, since their size are, AFAIK, always the same (and some VM code even assume this is always the case). > 2. cmpxchg_local does generate the "lock" prefix. It should not do that. > Without fixes to cmpxchg_local we cannot expect maximum performance. > Yup, see the patch I just posted for this. > 3. The approach is x86 centric. It relies on a cmpxchg that does not > synchronize with memory used by other cpus and therefore is more > lightweight. As far as I know the IA64 cmpxchg cannot do that. > Neither several other processors. I am not sure how cmpxchgless > platforms would use that. We need a detailed comparison of > interrupt enable /disable vs. cmpxchg cycle counts for cachelines in > the cpu cache to evaluate the impact that such a change would have. > > The cmpxchg (or its emulation) does not need any barriers since the > accesses can only come from a single processor. > Yes, expected improvements goes as follow: x86, x86_64 : must faster due to non-LOCKed cmpxchg alpha: should be faster due to memory barrier removal mips: memory barriers removed powerpc 32/64: memory barriers removed On other architectures, either there is no better implementation than the standard atomic cmpxchg or it just has not been implemented. I guess that a test series that would tell us how must improvement is seen on the optimized architectures (local cmpxchg vs interrupt enable/disable) and also what effect the standard cmpxchg has compared to interrupt disable/enable on the architectures where we can't do better than the standard cmpxchg will tell us if it is an interesting way to go. I would be happy to do these tests, but I don't have the hardware handy. I provide a test module to get these characteristics from various architectures in this email. > Mathieu measured a significant performance benefit coming from not using > interrupt enable / disable. > > Some rough processor cycle counts (anyone have better numbers?) > > STI CLI CMPXCHG > IA32 36 26 1 (assume XCHG == CMPXCHG, sti/cli also need stack > pushes/pulls) > IA64 12 12 1 (but ar.ccv needs 11 cycles to set comparator, > need register moves to preserve processors flags) > The measurements I get (in cycles): enable interrupts (STI) disable interrupts (CLI) local CMPXCHG IA32 (P4) 112 82 26 x86_64 AMD64 125 102 19 > Looks like STI/CLI is pretty expensive and it seems that we may be able to > optimize the alloc / free hotpath quite a bit if we could drop the > interrupt enable / disable. But we need some measurements. > > > Draft of a new patch: > > SLUB: Single atomic instruction alloc/free using cmpxchg_local > > A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg > is optimal to allow operations on per cpu freelist. We can stay on one > processor by disabling preemption() and allowing concurrent interrupts > thus avoiding the overhead of disabling and enabling interrupts. > > Pro: > - No need to disable interrupts. > - Preempt disable /enable vanishes on non preempt kernels > Con: > - Slightly complexer handling. > - Updates to atomic instructions needed > > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > Test local cmpxchg vs int disable/enable. Please run on a 2.6.22 kernel (or recent 2.6.21-rcX-mmX) (with my cmpxchg local fix patch for x86_64). Make sure the TSC reads (get_cycles()) are reliable on your platform. Mathieu /* test-cmpxchg-nolock.c * * Compare local cmpxchg with irq disable / enable. */ #include <linux/jiffies.h> #include <linux/compiler.h> #include <linux/init.h> #include <linux/module.h> #include <linux/calc64.h> #include <asm/timex.h> #include <asm/system.h> #define NR_LOOPS 20000 int test_val = 0; static void do_test_cmpxchg(void) { int ret; long flags; unsigned int i; cycles_t time1, time2, time; long rem; local_irq_save(flags); preempt_disable(); time1 = get_cycles(); for (i = 0; i < NR_LOOPS; i++) { ret = cmpxchg_local(&test_val, 0, 0); } time2 = get_cycles(); local_irq_restore(flags); preempt_enable(); time = time2 - time1; printk(KERN_ALERT "test results: time for non locked cmpxchg\n"); printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS); printk(KERN_ALERT "total time: %llu\n", time); time = div_long_long_rem(time, NR_LOOPS, &rem); printk(KERN_ALERT "-> non locked cmpxchg takes %llu cycles\n", time); printk(KERN_ALERT "test end\n"); } /* * This test will have a higher standard deviation due to incoming interrupts. */ static void do_test_enable_int(void) { long flags; unsigned int i; cycles_t time1, time2, time; long rem; local_irq_save(flags); preempt_disable(); time1 = get_cycles(); for (i = 0; i < NR_LOOPS; i++) { local_irq_restore(flags); } time2 = get_cycles(); local_irq_restore(flags); preempt_enable(); time = time2 - time1; printk(KERN_ALERT "test results: time for enabling interrupts (STI)\n"); printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS); printk(KERN_ALERT "total time: %llu\n", time); time = div_long_long_rem(time, NR_LOOPS, &rem); printk(KERN_ALERT "-> enabling interrupts (STI) takes %llu cycles\n", time); printk(KERN_ALERT "test end\n"); } static void do_test_disable_int(void) { unsigned long flags, flags2; unsigned int i; cycles_t time1, time2, time; long rem; local_irq_save(flags); preempt_disable(); time1 = get_cycles(); for ( i = 0; i < NR_LOOPS; i++) { local_irq_save(flags2); } time2 = get_cycles(); local_irq_restore(flags); preempt_enable(); time = time2 - time1; printk(KERN_ALERT "test results: time for disabling interrupts (CLI)\n"); printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS); printk(KERN_ALERT "total time: %llu\n", time); time = div_long_long_rem(time, NR_LOOPS, &rem); printk(KERN_ALERT "-> disabling interrupts (CLI) takes %llu cycles\n", time); printk(KERN_ALERT "test end\n"); } static int ltt_test_init(void) { printk(KERN_ALERT "test init\n"); do_test_cmpxchg(); do_test_enable_int(); do_test_disable_int(); return -EAGAIN; /* Fail will directly unload the module */ } static void ltt_test_exit(void) { printk(KERN_ALERT "test exit\n"); } module_init(ltt_test_init) module_exit(ltt_test_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Mathieu Desnoyers"); MODULE_DESCRIPTION("Cmpxchg local test"); -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/