On 9/7/24 1:09 AM, Richard Biener wrote:


Am 06.09.2024 um 17:38 schrieb Andrew Carlotti <andrew.carlo...@arm.com>:

Hi,

I'm working on optimising assignments to the AArch64 Floating-point Mode
Register (FPMR), as part of our FP8 enablement work.  Claudio has already
implemented FPMR as a hard register, with the intention that FP8 intrinsic
functions will compile to a combination of an fpmr register set, followed by an
FP8 operation that takes fpmr as an input operand.

It would clearly be inefficient to retain an explicit FPMR assignment prior to 
whic
each FP8 instruction (especially in the common case where every assignment uses
the same FPMR value).  I think the best way to optimise this would be to
implement a new pass that can optimise assignments to individual hard registers.

There are a number of existing passes that do similar optimisations, but which
I believe are unsuitable for this scenario for various reasons.  For example:

- cse1 can already optimise FPMR assignments within an extended basic block,
  but can't handle broader optimisations.
- pre (in gcse.c) doesn't work with assigning constant values, which would miss
  many potential usages.  It also has limits on how far code can be moved,
  based around ideas of register pressure that don't apply to the context of a
  single hard register that shouldn't be used by the register allocator for
  anything else.  Additionally, it doesn't run at -Os.
- hoist (also using gcse.c) only handles constant values, and only runs when
  optimising for size.  It also has the rest of the issues that pre does.
- mode_sw only handles a small finite set of modes.  The mode requirements are
  determined solely by the instructions that require the specific mode, so mode
  switches don't depend on the output of previous instructions.


My intention would be for the new pass to reuse ideas, and hopefully some of
the existing code, from the mode-switching and gcse passes.  In particular,
gcse.c (or it's dependencies) has code that could identify when values assigned
to the FPMR are known to be the same (although we may not need the full CSE
capabilities of gcse.c), and mode-switching.cc knows how to globally optimise
mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to avoid
excessively increasing register pressure).

Initially the new pass would only apply to the AArch64 FPMR register, but in
future it could also be used for other hard registers with similar properties.

Does anyone have any comments on this approach, before I start writing any
code?

Can you explain in more detail why the mode-switching pass
infrastructure isn’t a good fit?  ISTR it already is customizable via
target hooks.
Agreed.  Mode switching seems to be the right pass to look at.

It probably is worth pointing out that mode switching is LCM based and as such never speculates. Given the potential cost of a mode switch, failure to speculate may be a notable limitation (though the same would apply to the ideas Andrew floated above).

This has recently come up in the RISC-V space due to needing VXRM assignments so that we can utilize the vaaddu add-with-averaging instructions. Placement of VXRM mode switches looks optimal from an LCM standpoint, but speculation can measurably improve performance. It was something like 2% on the BPI for x264. The k1/m1 chip in the BPI is almost certainly flushing its pipelines on the VXRM assignment.

I've got a hack here that I'll submit upstream at some point. Just not at the top of my list yet -- especially now that our uarch has been fixed to not flush its pipelines at VXRM assignments ;-)

jeff

Reply via email to