On Mon, Jul 8, 2024 at 6:41 PM Andi Kleen <a...@linux.intel.com> wrote: > > > I have added a target hook for this in v4 of this patch. The hook > > receives all the information about the stores, the load, the estimated > > sequence cost and whether we expect to eliminate the load. With this > > information the target should be able to make an informed decision. > > > > What you mention is also true for AArch64: some microbenchmarking I > > did shows that some cores efficiently handle 32bit->64bit store > > forwarding while others not, so creating a target hook is necessary > > for such cases. > > Perhaps for the 32->64 case have a generic simple target flag. I presume it > will be common. > Yes, common cases should be easily enabled without custom hooks. In the AArch64 implementation I added some predefined tune flags to achieve that, but you're correct in that this should be available in a target independent way. Maybe one way would be to have two target hooks, one for choosing predefined behaviors and the other to decide in a fully custom way.
> On x86 there are lots of other cases too and the details vary based on > the micro architecture. I wonder if there is an efficient way to encode > that in a table. > Interesting, I'll think about whether a table can be used to help represent which cases should be avoided. Of course the combinations are too many when multiple stores per load are considered, but it all depends on what is common across most CPUs... which needs a lot of data to decide. > > This is still hard to tell. In some cases I have observed either > > improvement or regressions in benchmarks, which are highly susceptible > > to costing and the specific store-forwarding penalties of the CPU. > > I have seen cases where the store-forwarding instance is profitable to > > avoid but we get bad code generation due to other reasons (usually > > store_bit_field lowering not being good enough) and hence a > > regression. > > I wonder if there could be some heuristic to avoid it for those cases. > Well there is a sequence cost that is computed based on the expanded bit-insert instructions but it's still not too easy to decide. For AArch64 there is also the problem of not knowing in advance which loads/stores will form pair instructions. Apart from trying to have a reasonable costing scheme, I'm also observing where we get suboptimal codegen and try to improve that in the specific target. > > So I believe more time and testing is needed to really evaluate the > > speedups that can be achieved. > > So for now it would be off by default? > Yes, in all versions I submitted it's off by default, with the expectation that it can be enabled on a per-target basis, when there are performance benefits. Manolis > -Andi