From: Alexey Dobriyan > Sent: 14 September 2019 11:34 ... > +ENTRY(memset0_rep_stosq) > + xor eax, eax > +.globl memsetx_rep_stosq > +memsetx_rep_stosq: > + lea rsi, [rdi + rcx] > + shr rcx, 3 > + rep stosq > + cmp rdi, rsi > + je 1f > +2: > + mov [rdi], al > + add rdi, 1 > + cmp rdi, rsi > + jne 2b > +1: > + ret
You can do the 'trailing bytes' first with a potentially misaligned store. Something like (modulo asm syntax and argument ordering): lea rsi, [rdi + rdx] shr rcx, 3 jcxz 1f # Short buffer mov -8[rsi], rax rep stosq ret 1: mov [rdi], al add rdi, 1 cmp rdi, rsi jne 1b ret The final loop can be one instruction shorter by arranging to do: 1: mov [rdi+rxx], al add rdi, 1 jnz 1b ret Last I looked 'jcxz' was 'ok' on all recent amd and intel cpus. OTOH 'loop' is horrid on intel ones. The same applies to the other versions. I suspect it isn't worth optimising to realign misaligned buffers they are unlikely to happen often enough. I also think that gcc's __builtin version does some of the short buffer optimisations already. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)