> 
> Perhaps someone is interested in the following thread from LKML:
> 
> "[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops"
> 
> https://lore.kernel.org/lkml/20250605164733.737543-1-mjgu...@gmail.com/
> 
> There are several PRs regarding memcpy/memset linked from the above message.
> 
> Please also note a message from Linus from the above thread:
> 
> https://lore.kernel.org/lkml/CAHk-=wg1qqlwkpyvxxznxwbot48--lkjucjjf8phdhrxv0u...@mail.gmail.com/

This is my understanding of the situation.
Please correct me where I am wrong.

According to Linus, the calls in kernel are more expensive then
elsewhere due to mitigations.  I wonder if -minline-all-stringops
would make sense here.

Linus writes about the alternate entryopint for memcpy with non-standard
calling convention, which we also discussed few times in the past.
I think having call convention for memset/memcpy that only clobbers
SI/DE/CX and nothing else (especially no SSE regs) makes sense.

This should make offlined mempcy noticeably cheaper, specially when
called from loops that needs SSE and the implmentation can be done w/o
cloberring extra registers for small blocks while it will have enoug
time to spill for large ones.

The other patch does
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
for non-native CPUs (so something we should fix for generic tuning).

Which is about our current default to rep stosq that does not work well
on Intel hardware. We do loop for blocks up to 32bytes and rep stosq up
to 8k.

We now have X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for Intel cores, but
no changes for generic yet (it is on my TODO to do some more testing on
Zen).

So I think we can do following:
  1) decide whether to go with X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB
     or relpace rep_prefix_8_byte by unrolled_loop
  2) fix issue with repeated constants. I.e. instead

       movq $0, ....
       movq $0, ....
       ....
       movq $0, ....
      Which we currently generate for memset fitting in CLEAR_RATIO by
       mov $0, tmpreg
       movq tmpreg, ....
       movq tmpreg, ....
       ....
       movq tmpreg, ....
      Which will make memset sequences smaller.  I agree with Richi that HJ's
      patch that adds new cloar block expander is probably not a right place 
      for solving the problem.

      Ideall we should catch repeated constants more generally since
      this appears elsewhere too.
      I am not quite sure where to fit it best.  We already have a
      machine specific task that loads 0 into SSE register which is kind
      of similar to this as well.
  3) Figure out what are reasonable MOVE_RATIO/CLEAR_RATIO defaults
  4) Possibly go with the entry point idea?
Honza

Reply via email to