On Thu, Apr 18, 2013 at 6:22 AM, Jeff Law wrote: > On 04/17/2013 03:52 PM, Steven Bosscher wrote: >> >> First of all: What is still important to handle? >> >> It's clear that the expectations in reorg.c are "anything goes" but >> modern RISCs (everything since the PA-8000, say) probably have some >> limitations on what is helpful to have, or not have, in a delay slot. >> According to the comments in pa.h about MASK_JUMP_IN_DELAY, having >> jumps in delay slots of other jumps is one such thing: They don't >> bring benefit to the PA-8000 and they don't work with DWARF2 CFI. As >> far as I know, SPARC and MIPS don't allow jumps in delay slots, SH >> looks like it doesn't allow it either, and CRIS can do it for short >> branches but doesn't do because the trade-off between benefit and >> machine description complexity comes out negative. > > Note that sparc and/or mips might use the adjust the return pointer trick. > I know it wasn't my idea when I added it to the PA.
If they do that trick, it's not documented how that should work. It doesn't look like it though. Test case: ---- 8< ---- void f1 (void); void f2 (void); void f3 (void); void foo (long a) { if (a != 0) { f1 (); goto skip_some; } else f2 (); skip_some: f3 (); } ---- 8< ---- sparc64 assembly (with -O2 -fno-reorder-blocks): ---- 8< ---- foo: save %sp, -176, %sp brz,pt %i0, .L2 nop call f1, 0 nop ba,pt %xcc, .L3 nop .L2: call f2, 0 nop .L3: call f3, 0 restore ---- 8< ---- sparc32 is identical except for the frame size. mipsisa64 assembly (also with -O2 -fno-reorder-blocks): ---- 8< ---- foo: .frame $sp,8,$31 # vars= 0, regs= 1/0, args= 0, gp= 0 .mask 0x80000000,0 .fmask 0x00000000,0 .set noreorder .set nomacro daddiu $sp,$sp,-8 beq $4,$0,$L2 sd $31,0($sp) jal f1 nop j $L6 ld $31,0($sp) .align 3 $L2: jal f2 nop $L3 = . ld $31,0($sp) $L6: j f3 daddiu $sp,$sp,8 ---- 8< ---- >> On the scheduler >> implementation side: Branches as delayed insns in delay slots of other >> branches is impossible to express in the CFG (at least in GCC, but I >> think in general it can't be done cleanly). Therefore I want to drop >> support for branches in delay slots. What do you think about this? > > Certainly no need to support it in the generic case. The only question is > whether or not it's worth supporting the adjust the return pointer in the > delay slot stuff. Given an target without call/ret predictor stack, it can > be a singificant advantage. Such things might exist in the embedded space. This shouldn't be very difficult to support if the target models this as a jump in the delay slot of calls only. I can let the delay slot filler allow jumps in delay slots of calls but not in delay slots of other jumps. But for the moment I'm going to ignore this case unless someone knows a target in the FSF tree that would benefit of it. >> What about multiple delay slots? It looks like reorg.c has code to >> handle insns with multiple delay slots, but there currently are no GCC >> targets in the FSF tree that have insns with multiple delay slots and >> that use define_delay. > > Ping Hans, I think he was the last person who tried to deal with reorg and > multiple delay slots (c4x?). I certainly wouldn't lose any sleep if we > killed the limit support for multiple delay slots. Right, c4x has 3 delay slots. There are also out-of-tree ports for targets like SHARC. But most such DSP-like targets have some form of support for predication, so Bernd's c6x scheme would be a better fit. (And c4x is too old to care about anyway :-) >> Another thing I noticed about targets with delay slots that can be >> nullified, is that at least some of the ifcvt.c transformations could >> be applied to fill more delay slots (obviously if_case_1 and >> if_case_2. In reorg.c, optimize_skip does some kind of if-conversion. >> Has anyone looked at whether optimize_skip still does something, and >> derived a test case for that? > > I doubt anyone has looked at it recently. It pre-dates our if-conversion > code by a decade or more. So I collected some stats myself, for a small number (31) files of gcc itself, mostly from libcpp and various generator files, compiled at -O2 for sparc64: pass 1 pass 2 total simple eager skip simple eager skip insns 9743 3488 22 1297 525 0 filled 5918 2980 22 21 0 0 hit% 61% 31% 0% 0% 0% 0% total pass 1 pass 2 insns 9743 1297 filled 8920 21 hit% 92% 2% So the first fill_simple_delay_slots pass fills ~60% of the slots, and the first fill_eager_delay_slots fills another ~30%. The second pass is not very effective. The "skip" column is for a separate counter for optimize_skip. It triggers only 21 times in the first pass. If I disable if-conversion (-fno-if-conversion{,2}) this number goes up to 31, and with if-conversion *and* bb-reorder disabled (-fno-reorder-blocks) it hits 63 times. Even at -Os, without if-conversion, without bb-reorder, optimize_skip only hits 77 times (compared to filled simple:7216; filled eager:3563). The effect of if-conversion is interesting on its own. With if-conversion enabled there are 9743 insns needing delay slots. Without if-conversion there are 12565 insns needing delay slots, but they are apparently only a little more difficult to fill, because the second pass fills more slots: total pass 1 pass 2 insns 12565 1434 filled 11185 66 hit% 89% 5% In pass 1, simple fills 7420 slots (59%) and eager fills 3765 slots (30%), which appears to be the typical ratio for sparc, but also for the mips targets I've played with. Ciao! Steven