On 09/18/2015 05:10 AM, Simon Dardis wrote:
Are you trying to say that you have the option as to what kind of
branch to use?  ie, "ordinary", presumably without a delay slot or one
with a delay slot?

Is the "ordinary" actually just a nullified delay slot or some form of
likely/not likely static hint?

Specifically for MIPSR6: the ISA possesses traditional delay slot branches and
a normal branch (no delay slots, not annulling, no hints, subtle static hazard),
aka "compact branch" in MIPS terminology. They could be described as nullify
on taken delay slot branch but we saw little to no value in that.

Matthew Fortune provided a writeup with their handling in GCC:

https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01892.html
Thanks. I never looked at that message, almost certainly because it was MIPS specific. I'm trying hard to stay out of backends that have good active maintainers, and MIPS certainly qualifies on that point.



But what is the compact form at the micro-architectural level?  My
mips-fu has diminished greatly, but my recollection is the bubble is
always there.   Is that not the case?

The pipeline bubble will exist but the performance impact varies across
R6 cores. High-end OoO cores won't be impacted as much, but lower
end cores will. microMIPSR6 removes delay slot branches altogether which
pushes the simplest micro-architectures to optimize away the cost of a
pipeline bubble.
[ ... snip more micro-archticture stuff ... ]
Thanks. That helps a lot. I didn't realize the bubble was being squashed to varying degrees. And FWIW, I wouldn't be surprised if you reach a point on the OoO cores where you'll just want to move away from delay slots totally and rely on your compact branches as much as possible. It may give your hardware guys a degree of freedom that helps them in the common case (compact branches) at the expense of slowing down code with old fashioned delay slots.

Compact branches do a strange restriction in that they cannot be followed by a
CTI. This is to simplify branch predictors apparently but this may be lifted in
future ISA releases.
Come on! :-) There's some really neat things you can do when you allow branches in delay slots. The PA was particularly fun in that regard. My recollection is HP had some hand written assembly code in their libraries which exploited the out-of-line execution you could get in this case. We never tried to exploit in GCC simply because the opportunities didn't see all that common or profitable.




If it is able to find insns from the commonly executed path that don't
have a long latency, then the fill is usually profitable (since the
pipeline bubble always exists).  However, pulling a long latency
instruction (say anything that might cache miss or an fdiv/fsqrt) off
the slow path and conditionally nullifying it can be *awful*.
Everything else is in-between.

I agree. The variability in profit/loss in a concern and I see two ways to deal
with it:

A) modify the delay slot filler so that it choses speculative instructions of
less than some $cost and avoid instruction duplication when the eager filler
picks an instruction from a block with multiple predecessors. Making such
changes would be invasive and require more target specific hooks.
The cost side here should be handled by existing mechanisms. You just never allow anything other than simple arith, logicals & copies.

You'd need a hook to avoid this when copying was needed.

You'd probably also need some kind of target hook to indicate the level of prediction where this is profitable since the cost varies across your micro-architectures.

And you'd also have to worry about the special code which triggers when there's a well predicted branch, but a resource conflict. In that case reorg can fill the slot from the predicted path and insert compensation code on the non-predicted path.




B) Use compact branches instead of speculative delay slot execution and forsake
variable performance for a consistent pipeline bubble by not using the
speculative delay filler altogether.

Between these two choices, B seems to better option as due to sheer simplicity.
Choosing neither gives speculative instruction execution when there could be a
small consistent penalty instead.
B is certainly easier.

The main objection I had was given my outdated knowledge of the MIPS processors it seemed like you were taking a step backwards. You've cleared that up and if you're comfortable with the tradeoff, then I won't object to the target hook to disable eager filling.

Can you repost that patch? Given I was the last one to do major work on reorg (~20 years ago mind you) it probably makes the most sense for me to own the review.

jeff


Reply via email to