> On 22 Jan 2025, at 13:53, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> Hi Richard, >> >>> On 22 Jan 2025, at 13:21, Richard Sandiford <richard.sandif...@arm.com> >>> wrote: >>> >>> GCC 15 is the first release to support FP8 intrinsics. >>> The underlying instructions depend on the value of a new register, >>> FPMR. Unlike FPCR, FPMR is a normal call-clobbered/caller-save >>> register rather than a global register. So: >>> >>> - The FP8 intrinsics take a final uint64_t argument that >>> specifies what value FPMR should have. >>> >>> - If an FP8 operation is split across multiple functions, >>> it is likely that those functions would have a similar argument. >>> >>> If the object code has the structure: >>> >>> for (...) >>> fp8_kernel (..., fpmr_value); >>> >>> then fp8_kernel would set FPMR to fpmr_value each time it is >>> called, even though FPMR will already have that value for at >>> least the second and subsequent calls (and possibly the first). >>> >>> The working assumption for the ABI has been that writes to >>> registers like FPMR can in general be more expensive than >>> reads and so it would be better to use a conditional write like: >>> >>> mrs tmp, fpmr >>> cmp tmp, <value> >>> beq 1f >>> nsr fpmr, <value> >> >> Typo “msr” here and in the comment in the code. > > Oops, thanks, will fix. > >> [...] >>> @@ -1883,6 +1884,44 @@ (define_split >>> } >>> ) >>> >>> +;; The preferred way of writing to the FPMR is to test whether it already >>> +;; has the desired value and branch around the write if so. This reduces >>> +;; the number of redundant FPMR writes caused by ABI boundaries, such as >>> in: >>> +;; >>> +;; for (...) >>> +;; fp8_kernel (..., fpmr_value); >>> +;; >>> +;; Without this optimization, fp8_kernel would set FPMR to fpmr_value each >>> +;; time that it is called. >>> +;; >>> +;; We do this as a split so that hardreg_pre can optimize the moves first. >>> +(define_split >>> + [(set (reg:DI FPM_REGNUM) >>> + (match_operand:DI 0 "aarch64_reg_or_zero"))] >>> + "TARGET_FP8 && !TARGET_CHEAP_FPMR_WRITE && can_create_pseudo_p ()" >>> + [(const_int 0)] >>> + { >>> + auto label = gen_label_rtx (); >>> + rtx current = copy_to_reg (gen_rtx_REG (DImode, FPM_REGNUM)); >>> + rtx cond = gen_rtx_EQ (VOIDmode, current, operands[0]); >>> + emit_jump_insn (gen_cbranchdi4 (cond, current, operands[0], label)); >> >> Do you think it’s worth marking this jump as likely? >> In some other expand code in the backend where we emit jumps we sometimes >> use aarch64_emit_unlikely_jump. > > Ah, yeah, I should have said that I'd wondered about that. But in the > end it didn't seem appropriate. Given that hardreg_pre should remove > local instances of redundancy, we don't really have much information > about whether the branch is likely or unlikely. I think instead the > hope/expectation is that the branch has a predictable pattern.
Ok, thanks for clarifying. Kyrill > > Thanks, > Richard