Hi Brian, Thanks for replying.
Is there a way to simplify to rep movsb (or movsw since the array is uint16_t) without using assembly code? The code is currently 100% C and I would prefer to avoid having a mix of C and assembly code. Also, it is unclear to me that even repsw would be faster than my bloated C code that generates assembly code that does 8 word moves in 5 instructions: .L25: movdqu -16(%rax), %xmm1 subq $16, %rax movups %xmm1, 2(%rax) cmpq %rdx, %rax jnb .L25 I am currently evaluating the stash and move method for the uint16_t data at the start that can't be moved in 16 byte chunks. It uses two extra 64 bit registers but that may be better than having the compiler move addresses into registers for a memmove call that moves the last 2 - 14 bytes (which is what the compiler does). I didn't think of that option until looking at the memmove assembly code. I thought alignment might be an issue, but noticed that the memmove assembly code does not perform alignment. It first checks the number of bytes to move. If the number of bytes to move is less than 8 it jumps to the movsb section. If the length is 8 or more it stashes the highest address bytes that need to be moved. Then it moves the data with rep movsq starting at the beginning or near the end (for backward moves) of the array, moving the data until the remaining length is less than 8 bytes (or 0 bytes for backward moves). Then it uses the stashed data to finish the move. The addresses for the rep movsq could have any alignment that is consistent with the alignment of the data being moved. Since this code is moving uint16_t's, the alignment is only 2 byte alignment for the rep movsq. At least if I am reading the assembly code correctly.... Best Regards, Kennon > On 02/27/2026 11:49 AM PST Brian Inglis via Cygwin <[email protected]> wrote: > > > Hi Kennon, > > Some perf reports and analysis imply that backward moves (with overlap?) are > no > faster than straight rep movsb on some CPUs, so it may be better to just > simplify to that, unless you want to stash the final element(s) to be moved > out > of the way in register(s), and use multiple registers in unrolled wide moves > for > the aligned portion? > -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

