Re: [PATCH 3/3] aarch64: Avoid redundant writes to FPMR

Kyrylo Tkachov Wed, 22 Jan 2025 05:30:10 -0800


> On 22 Jan 2025, at 13:53, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Hi Richard,
>> 
>>> On 22 Jan 2025, at 13:21, Richard Sandiford <richard.sandif...@arm.com> 
>>> wrote:
>>> 
>>> GCC 15 is the first release to support FP8 intrinsics.
>>> The underlying instructions depend on the value of a new register,
>>> FPMR.  Unlike FPCR, FPMR is a normal call-clobbered/caller-save
>>> register rather than a global register.  So:
>>> 
>>> - The FP8 intrinsics take a final uint64_t argument that
>>> specifies what value FPMR should have.
>>> 
>>> - If an FP8 operation is split across multiple functions,
>>> it is likely that those functions would have a similar argument.
>>> 
>>> If the object code has the structure:
>>> 
>>>   for (...)
>>>     fp8_kernel (..., fpmr_value);
>>> 
>>> then fp8_kernel would set FPMR to fpmr_value each time it is
>>> called, even though FPMR will already have that value for at
>>> least the second and subsequent calls (and possibly the first).
>>> 
>>> The working assumption for the ABI has been that writes to
>>> registers like FPMR can in general be more expensive than
>>> reads and so it would be better to use a conditional write like:
>>> 
>>>      mrs     tmp, fpmr
>>>      cmp     tmp, <value>
>>>      beq     1f
>>>      nsr     fpmr, <value>
>> 
>> Typo “msr” here and in the comment in the code.
> 
> Oops, thanks, will fix.
> 
>> [...]
>>> @@ -1883,6 +1884,44 @@ (define_split
>>>  }
>>> )
>>> 
>>> +;; The preferred way of writing to the FPMR is to test whether it already
>>> +;; has the desired value and branch around the write if so.  This reduces
>>> +;; the number of redundant FPMR writes caused by ABI boundaries, such as 
>>> in:
>>> +;;
>>> +;;    for (...)
>>> +;;      fp8_kernel (..., fpmr_value);
>>> +;;
>>> +;; Without this optimization, fp8_kernel would set FPMR to fpmr_value each
>>> +;; time that it is called.
>>> +;;
>>> +;; We do this as a split so that hardreg_pre can optimize the moves first.
>>> +(define_split
>>> +  [(set (reg:DI FPM_REGNUM)
>>> +        (match_operand:DI 0 "aarch64_reg_or_zero"))]
>>> +  "TARGET_FP8 && !TARGET_CHEAP_FPMR_WRITE && can_create_pseudo_p ()"
>>> +  [(const_int 0)]
>>> +  {
>>> +    auto label = gen_label_rtx ();
>>> +    rtx current = copy_to_reg (gen_rtx_REG (DImode, FPM_REGNUM));
>>> +    rtx cond = gen_rtx_EQ (VOIDmode, current, operands[0]);
>>> +    emit_jump_insn (gen_cbranchdi4 (cond, current, operands[0], label));
>> 
>> Do you think it’s worth marking this jump as likely?
>> In some other expand code in the backend where we emit jumps we sometimes 
>> use aarch64_emit_unlikely_jump.
> 
> Ah, yeah, I should have said that I'd wondered about that.  But in the
> end it didn't seem appropriate.  Given that hardreg_pre should remove
> local instances of redundancy, we don't really have much information
> about whether the branch is likely or unlikely.  I think instead the
> hope/expectation is that the branch has a predictable pattern.


Ok, thanks for clarifying.
Kyrill

> 
> Thanks,
> Richard

Re: [PATCH 3/3] aarch64: Avoid redundant writes to FPMR

Reply via email to