> I have added a target hook for this in v4 of this patch. The hook > receives all the information about the stores, the load, the estimated > sequence cost and whether we expect to eliminate the load. With this > information the target should be able to make an informed decision. > > What you mention is also true for AArch64: some microbenchmarking I > did shows that some cores efficiently handle 32bit->64bit store > forwarding while others not, so creating a target hook is necessary > for such cases.
Perhaps for the 32->64 case have a generic simple target flag. I presume it will be common. On x86 there are lots of other cases too and the details vary based on the micro architecture. I wonder if there is an efficient way to encode that in a table. > This is still hard to tell. In some cases I have observed either > improvement or regressions in benchmarks, which are highly susceptible > to costing and the specific store-forwarding penalties of the CPU. > I have seen cases where the store-forwarding instance is profitable to > avoid but we get bad code generation due to other reasons (usually > store_bit_field lowering not being good enough) and hence a > regression. I wonder if there could be some heuristic to avoid it for those cases. > So I believe more time and testing is needed to really evaluate the > speedups that can be achieved. So for now it would be off by default? -Andi