On Mon, 2015-08-17 at 19:16 +0800, Kevin Hao wrote: > On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote: > > I tried a couple different benchmarks and didn't find a significant > > difference, relative to the variability of the results running on the > > same > > kernel. A patch that claims to "optimize a bit" as its main purpose > > ought to > > show some results. :-) > > I tried to compare the execution time of these two code sequences with the > following test module: > > #include <linux/module.h> > #include <linux/kernel.h> > #include <linux/printk.h> > > static void test1(void) > { > int i; > unsigned char lock, c; > unsigned short cpu, s; > > for (i = 0; i < 100000; i++) { > lock = 0; > cpu = 1; > > asm volatile ( > "1: lbarx %0,0,%2\n\ > lhz %1,0(%3)\n\ > cmpdi %0,0\n\ > cmpdi cr1,%1,1\n\
This should be either "cmpdi cr1,%0,1" or crclr, not that it made much difference. The test seemed to be rather sensitive to additional instructions inserted at the beginning of the asm statement (especially isync), so the initial instructions before the loop are probably pairing with something outside the asm. That said, it looks like this patch at least doesn't make things worse, and does convert cmpdi to a more readable crclr, so I guess I'll apply it even though it doesn't show any measurable benefit when testing entire TLB misses (much less actual applications). I suspect the point where I misunderstood the core manual was where it listed lbarx as having a repeat-rate of 3 cycles. I probably assumed that that was because of the presync, and thus a subsequent unrelated load could execute partially in parallel, but it looks like the repeat rate is specifically talking about how long it is until the execution unit can accept any other instruction. -Scott _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev