> I have added a target hook for this in v4 of this patch. The hook
> receives all the information about the stores, the load, the estimated
> sequence cost and whether we expect to eliminate the load. With this
> information the target should be able to make an informed decision.
> 
> What you mention is also true for AArch64: some microbenchmarking I
> did shows that some cores efficiently handle 32bit->64bit store
> forwarding while others not, so creating a target hook is necessary
> for such cases.

Perhaps for the 32->64 case have a generic simple target flag. I presume it
will be common.

On x86 there are lots of other cases too and the details vary based on
the micro architecture. I wonder if there is an efficient way to encode
that in a table.

> This is still hard to tell. In some cases I have observed either
> improvement or regressions in benchmarks, which are highly susceptible
> to costing and the specific store-forwarding penalties of the CPU.
> I have seen cases where the store-forwarding instance is profitable to
> avoid but we get bad code generation due to other reasons (usually
> store_bit_field lowering not being good enough) and hence a
> regression.

I wonder if there could be some heuristic to avoid it for those cases.

> So I believe more time and testing is needed to really evaluate the
> speedups that can be achieved.

So for now it would be off by default?

-Andi

Reply via email to