On Thu, Jun 13, 2024 at 7:18 PM Andi Kleen <a...@linux.intel.com> wrote:
>
> Manolis Tsamis <manolis.tsa...@vrull.eu> writes:
> >
> > Assembly like this can appear with bitfields or type punning / unions.
> > On stress-ng when running the cpu-union microbenchmark the following 
> > speedups
> > have been observed.
> >
> >   Neoverse-N1:      +29.4%
> >   Intel Coffeelake: +13.1%
> >   AMD 5950X:        +17.5%
>
> It seems this should have some kind of target hook so that the target
> can configure what forwards should be avoided. At least in x86 land
> there is a trend to the hardware handling more and more cases with each
> generation.
>
Hi Andi,

I have added a target hook for this in v4 of this patch. The hook
receives all the information about the stores, the load, the estimated
sequence cost and whether we expect to eliminate the load. With this
information the target should be able to make an informed decision.

What you mention is also true for AArch64: some microbenchmarking I
did shows that some cores efficiently handle 32bit->64bit store
forwarding while others not, so creating a target hook is necessary
for such cases.

> Also is there any data what this does to code size? Perhaps it should be
> only done on hot blocks?
>
I haven't seen any large code size increases in general. In large
benchmark it's usually some tens or few hundreds of instructions
total. But in any case, for v4 I disable the pass based on
optimize_insn_for_speed_p since we do expect a small size increase.

> And did you see speedups on real applications?

This is still hard to tell. In some cases I have observed either
improvement or regressions in benchmarks, which are highly susceptible
to costing and the specific store-forwarding penalties of the CPU.
I have seen cases where the store-forwarding instance is profitable to
avoid but we get bad code generation due to other reasons (usually
store_bit_field lowering not being good enough) and hence a
regression.
So I believe more time and testing is needed to really evaluate the
speedups that can be achieved.

Thanks,
Manolis
>
> -Andi

Reply via email to