> > Perhaps someone is interested in the following thread from LKML: > > "[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops" > > https://lore.kernel.org/lkml/20250605164733.737543-1-mjgu...@gmail.com/ > > There are several PRs regarding memcpy/memset linked from the above message. > > Please also note a message from Linus from the above thread: > > https://lore.kernel.org/lkml/CAHk-=wg1qqlwkpyvxxznxwbot48--lkjucjjf8phdhrxv0u...@mail.gmail.com/
This is my understanding of the situation. Please correct me where I am wrong. According to Linus, the calls in kernel are more expensive then elsewhere due to mitigations. I wonder if -minline-all-stringops would make sense here. Linus writes about the alternate entryopint for memcpy with non-standard calling convention, which we also discussed few times in the past. I think having call convention for memset/memcpy that only clobbers SI/DE/CX and nothing else (especially no SSE regs) makes sense. This should make offlined mempcy noticeably cheaper, specially when called from loops that needs SSE and the implmentation can be done w/o cloberring extra registers for small blocks while it will have enoug time to spill for large ones. The other patch does +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign for non-native CPUs (so something we should fix for generic tuning). Which is about our current default to rep stosq that does not work well on Intel hardware. We do loop for blocks up to 32bytes and rep stosq up to 8k. We now have X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for Intel cores, but no changes for generic yet (it is on my TODO to do some more testing on Zen). So I think we can do following: 1) decide whether to go with X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB or relpace rep_prefix_8_byte by unrolled_loop 2) fix issue with repeated constants. I.e. instead movq $0, .... movq $0, .... .... movq $0, .... Which we currently generate for memset fitting in CLEAR_RATIO by mov $0, tmpreg movq tmpreg, .... movq tmpreg, .... .... movq tmpreg, .... Which will make memset sequences smaller. I agree with Richi that HJ's patch that adds new cloar block expander is probably not a right place for solving the problem. Ideall we should catch repeated constants more generally since this appears elsewhere too. I am not quite sure where to fit it best. We already have a machine specific task that loads 0 into SSE register which is kind of similar to this as well. 3) Figure out what are reasonable MOVE_RATIO/CLEAR_RATIO defaults 4) Possibly go with the entry point idea? Honza