On 6/12/24 12:47 AM, Richard Biener wrote:
One of the points I wanted to make is that sched1 can make quite a
difference as to the relative distance of the store and load and
we have the instruction window the pass considers when scanning
(possibly driven by target uarch details). So doing the rewriting
before sched1 might be not ideal (but I don't know how much cleanup
work the pass leaves behind - there's nothing between sched1 and RA).
ACK. I guess I'm just skeptical about much separation we can get in
practice from scheduling.
As far as cleanup opportunity, it likely comes down to how clean the
initial codegen is for the bitfield insertion step.
On the hardware side I always wondered whether a failed load-to-store
forward results in the load uop stalling (because the hardware actually
_did_ see the conflict with an in-flight store) or whether this gets
catched later as the hardware speculates a load from L1 (with the
wrong value) but has to roll back because of the conflict. I would
imagine the latter is cheaper to implement but worse in case of
conflict.
I wouldn't be surprised to see both approaches being used and I suspect
it really depends on how speculative your uarch is. At some point
there's enough speculation going on that you can't detect the violation
early enough and you have to implement a replay/rollback scheme.
jeff