On Tue, May 26, 2015 at 04:37:46PM +0930, Alan Modra wrote: > On powerpc64le, modifying the way combine treats function parameters > and call arguments results in some regressions. > > For instance, this testcase from varasm.c > > extern int foo3 (void *, ...); > extern void foo4 (void *, const char *); > int > emit_tls_common (void *decl, > const char *name, > unsigned long size) > { > foo3 (0, "\t%s\t", ".."); > foo4 (0, name); > foo3 (0, ",%lu,%u\n", size, ((unsigned int *)decl)[88] / 8); > return 1; > } > > at -O2 produces for the prologue and first call > > old new > mflr 0 mflr 0 > std 29,-24(1) std 29,-24(1) > std 30,-16(1) std 30,-16(1) > mr 29,4 addis 9,2,.LC0@toc@ha > std 31,-8(1) std 31,-8(1) > addis 4,2,.LC1@toc@ha addis 10,2,.LC1@toc@ha > mr 31,5 addi 9,9,.LC0@toc@l > addis 5,2,.LC0@toc@ha addi 10,10,.LC1@toc@l > mr 30,3 mr 30,3 > addi 5,5,.LC0@toc@l mr 29,4 > addi 4,4,.LC1@toc@l mr 31,5 > li 3,0 mr 4,10 > std 0,16(1) mr 5,9 > stdu 1,-128(1) std 0,16(1) > bl foo3 stdu 1,-128(1) > nop li 3,0 > bl foo3 > nop > > As you can see, we have some extra register shuffling from keeping a > pseudo for arg setup insns. I guess the pseudos allow sched more > freedom to mess around..
... and then RA isn't able to move things back. I see this happening with all three changes (return value, incoming args, outgoing args); the changes to combine give sched1 and RA more freedom, but those then end up generating lots of unnecessary register moves. > On the positive side, I saw cases where keeping parameter pseudos > allowed shrink-wrap to occur. varasm.c:decode_reg_name_and_count is > one of them. More shrink-wrapping is a big win. > > Here's a case where changes at the return result in poorer code > int > decl_readonly_section_1 (int category) > { > switch (category) > { > case 1: > case 2: > case 3: > case 4: > case 5: > return 1; > default: > return 0; > } > } > old new > addi 9,3,-6 addi 9,3,-6 > neg 3,3 neg 3,3 > and 3,9,3 and 3,9,3 > rldicl 3,3,33,63 srwi 3,3,31 > blr rldicl 3,3,0,32 > blr > > Previously this: > (insn 35 34 36 2 (set (reg:SI 161) > (lshiftrt:SI (reg:SI 164) > (const_int 31 [0x1f]))) {lshrsi3}) > (insn 36 35 23 2 (set (reg:DI 155 [ D.2441 ]) > (zero_extend:DI (reg:SI 161))) {zero_extendsidi2}) > (insn 23 36 24 2 (set (reg/i:DI 3 3) > (reg:DI 155 [ D.2441 ])) {*movdi_internal64}) > > is first combined to > (insn 35 34 36 2 (set (reg:SI 161) > (lshiftrt:SI (reg:SI 164) > (const_int 31 [0x1f]))) {lshrsi3}) > (insn 23 35 24 2 (set (reg/i:DI 3 3) > (and:DI (subreg:DI (reg:SI 161) 0) > (const_int 1 [0x1])))) > which is somewhat surprising, but from my previous forays into > combine.c I'd say happens due to known zero bits. (Just looking at > dumps here, rather than in gdb.) > > Then the above is further combined to > (insn 23 34 24 2 (set (reg/i:DI 3 3) > (zero_extract:DI (subreg:DI (reg:SI 164) 0) > (const_int 1 [0x1]) > (const_int 32 [0x20]))) > > Looks to me like a missed optimization opportunity that insns 35 and > 36 aren't combined without first going through the intermediate step. The rs6000 backend doesn't have zero_extend variants of many of its patterns, only some. Well-known problem, long-term project. > Anyway, here's the rewritten patch. I've left in some knobs I used > when testing in case you want to see for yourself what happens with > various options. Bootstrapped etc. powerpc64le-linux and > x86_64-linux. > +#define DONT_COMBINE_PARAMS 1 > +#define DONT_COMBINE_CALL_ARGS 1 I tested with all combinations of those knob settings, building Linux kernels (mostly defconfigs); these are the resulting text sizes: master alan00 alan10 alan01 alan11 5432728 5432728 5433848 5435472 5436080 alpha 3851131 3851391 3852495 3852567 3853755 arm 2190716 2190716 2190716 2190708 2190708 blackfin 2191439 2191503 2191983 2192335 2192751 c6x 2213186 2213250 2213154 2213482 2213546 cris 3322420 3322420 3322500 3322564 3322692 frv 10898664 10898664 10898664 10898664 10898664 i386 3253459 3253539 3253599 3255235 3255331 m32r 4708528 4708532 4709772 4708660 4709656 microblaze 3949689 3949745 3950269 3952401 3952857 mips 5693823 5693975 5694443 5697971 5698227 mips64 2374664 2374708 2375126 2375485 2375841 mn10300 7488219 7488419 7492299 7489743 7493431 parisc 6195727 6195935 6196367 6221647 6221687 parisc64 8688975 8689111 8691931 8692619 8695279 powerpc 20252077 20252337 20255185 20266897 20269477 powerpc64 11423734 11421858 11423946 11422726 11424838 s390 6488342 6488406 6488534 6489110 6489302 sh64 1545652 1545652 1545776 1545812 1545944 shnommu 3737005 3736973 3737357 3737585 3737945 sparc 6075342 6074426 6074762 6075570 6075834 sparc64 12301607 12301607 12301543 12301787 12301723 x86_64 1954029 1954061 1954441 1948841 1949305 xtensa As you see, almost everything regresses code size. s390 and sparc and xtensa like some of it; everything else generates more register moves. A typical one for powerpc, with the "00" setting: before after rlwinm 3,9,N,N,N rlwinm 9,9,N,N,N mr 3,9 (and then blr etc., no further uses of r3 or r9; nothing special about rlwinm here, there also are "and" and "addi" cases, etc.) I think we should at least ameliorate this regression before we can apply any variant of this :-( Segher