On Thu, Jun 12, 2014 at 06:22:11AM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2014-06-11 at 14:37 -0500, Christoph Lameter wrote: > > Looking at arch/powerpc/include/asm/percpu.h I see that the per cpu offset > > comes from a local_paca field and local_paca is in r13. That means that > > for all percpu operations we first have to determine the address through a > > memory access. > > > > Would it be possible to put the paca at the beginning of the percpu data > > area and then have r31 point to the percpu area? > > > > power has these nice instructions that fetch from an offset relative to a > > base register which could be used throughout for percpu operations in the > > kernel (similar to x86 segment registers). > > > > With that we may also be able to use the atomic ops for fast percpu access > > so that we can avoid the irq enable/disable sequence that is now required > > for percpu atomics. Would result in fast and reliable percpu > > counters for powerpc. > > So.... this is complicated :) And it's something I did want to tackle > for a while but haven't had a chance. > > The issues off the top of my head are: > > - The PACA must be accessible in real mode, which means that when > running under a hypervisor, it must be allocated in the "RMA" which is > the low part of memory up to a limit that depends on the hypervisor, but > can be as low as 128M on some older machines. > > - However, we use percpu more than paca in normal kernel C code, the > PACA is mostly used during exception entry/exit, KVM, and for interrupt > soft-enable/disable. So it might make sense to change things so that r13 > contains the per-cpu offset instead. However, doing that change and > updating the asm to cope isn't a trivial undertaking. > > - Direct offset from r13 in asm ... works as long as the offset is > within the signed 32k range. Otherwise we need at least one more addis > instruction. Anton mentioned the linker may have some smarts however for > removing that addis if the high part of the offset happens to be 0. > > - For atomics, the jury is still out as to whether it would be faster > or not. The atomic ops (lwarx/stwcx.) are expensive. They flush the > value out of the L1 (to L2) among others. On the other hand we have > interrupts soft-disable so masking interrupts isn't very expensive. > Unmasking, while cheap, is currently out of line however. I have been > wondering if we could move some of the soft-irq state instead to a CR > field and mark that -ffixed with gcc so we can make irq > soft-disable/enable even faster and more in-line.
Actually, from gcc/config/rs6000.h: /* 1 for registers that have pervasive standard uses and are not available for the register allocator. On RS/6000, r1 is used for the stack. On Darwin, r2 is available as a local register; for all other OS's r2 is the TOC pointer. cr5 is not supposed to be used. On System V implementations, r13 is fixed and not available for use. */ #define FIXED_REGISTERS \ {0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, \ /* AltiVec registers. */ \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ 1, 1 \ , 1, 1, 1, 1, 1, 1 \ } So cr5, which is number 73, is never used by gcc. Disassembling a few kernels seems to confirm this. This gives you 4 booleans to play with... Gabriel _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev