On 12/01/2015 02:11 PM, Steve Ellcey wrote:
With the current top-of-tree we now generate:
addiu $4,$4,1
$L8:
lbu $3,-1($4)
addiu $5,$5,1
beq $3,$0,$L7
lbu $2,-1($5) # This is a branch delay slot
beq $3,$2,$L8
addiu $4,$4,1 # This is a branch delay slot
subu $2,$3,$2 # Done only once now after exiting loop.
The main problem with the new loop is that the beq comparing $2 and $3
is right before the load of $2 so there can be a delay due to the time
that the load takes. The ideal code would probably be:
I'd start by looking at the code prior to reorg/delay slot scheduling.
It may be the case that you're running into the well known issue that
when reorg knows nothing about latency/scheduling issues and happily
picks whatever insn can safely fill the delay slot. In doing so, reorg
may muck up the schedule badly.
If that's the case you might test disallowing operations with > 1 cycle
latency in delay slots and see how that effects a wider range of benchmarks.
Jeff