On Wed, Dec 2, 2015 at 5:11 AM, Steve Ellcey <sell...@imgtec.com> wrote: > > I have a question involving ivopts and PR 48814, which was a fix for > the post increment operation. Prior to the fix for PR 48814, MIPS > would generate this loop for strcmp (C code from glibc): > > $L4: > lbu $3,0($4) > lbu $2,0($5) > addiu $4,$4,1 > beq $3,$0,$L7 > addiu $5,$5,1 # This is a branch delay slot > beq $3,$2,$L4 > subu $2,$3,$2 # This is a branch delay slot (only used after > loop) > > > With the current top-of-tree we now generate: > > addiu $4,$4,1 > $L8: > lbu $3,-1($4) > addiu $5,$5,1 > beq $3,$0,$L7 > lbu $2,-1($5) # This is a branch delay slot > beq $3,$2,$L8 > addiu $4,$4,1 # This is a branch delay slot > > subu $2,$3,$2 # Done only once now after exiting loop. > > The main problem with the new loop is that the beq comparing $2 and $3 > is right before the load of $2 so there can be a delay due to the time > that the load takes. The ideal code would probably be: > > addiu $4,$4,1 > $L8: > lbu $3,-1($4) > lbu $2,0($5) # This is a branch delay slot > beq $3,$0,$L7 > addiu $5,$5,1 > beq $3,$2,$L8 > addiu $4,$4,1 # This is a branch delay slot > > subu $2,$3,$2 # Done only once now after exiting loop. > > Where we load $2 earlier (using a 0 offset instead of a -1 offset) and > then do the increment of $5 after using it in the load. The problem > is that this isn't something that can just be done in the instruction > scheduler because we are changing one of the instructions (to modify the > offset) in addition to rearranging them and I don't think the instruction > scheduler supports that. Hmm, I think Bernd introduced sched_flag !DONT_BREAK_DEPENDENCIES to resolve dependence by modifying address expression. I think this is the same problem, what's needed is to model dependence using that framework. Maybe delay slot is special here?
> > It looks like is the ivopts code that decided to increment the registers > first and use the -1 offsets in the loads after instead of using 0 offsets > and then incrementing the offsets after the loads but I can't figure out > how or why ivopts made that decision. > > Does anyone have any ideas on how I could 'fix' GCC to make it generate > the ideal code? Is there some way to do it in the instruction scheduler? > Is there some way to modify ivopts to fix this by modifying the cost It's likely IVO just peaks the first candidate when it runs into a tie. Could you please post preprocessed source code so that I can have a look? I am not familiar with glibc. Thanks. > analysis somehow? Could I (partially) undo the fix for PR 48814? > According to the final comment in that bugzilla report the change is > really only needed for C11 and that the change does degrade the optimizer > so could we go back to the old behaviour for C89/C99? The code in ivopts I saw this change caused code size regression on arm embedded processors. Thanks, bin > has changed enough since the patch was applied I couldn't immediately see > how to do that in the ToT sources. > > Steve Ellcey > sell...@imgtec.com