From: Alexey Dobriyan
> Sent: 14 September 2019 11:34
...
> +ENTRY(memset0_rep_stosq)
> +     xor     eax, eax
> +.globl memsetx_rep_stosq
> +memsetx_rep_stosq:
> +     lea     rsi, [rdi + rcx]
> +     shr     rcx, 3
> +     rep stosq
> +     cmp     rdi, rsi
> +     je      1f
> +2:
> +     mov     [rdi], al
> +     add     rdi, 1
> +     cmp     rdi, rsi
> +     jne     2b
> +1:
> +     ret

You can do the 'trailing bytes' first with a potentially misaligned store.
Something like (modulo asm syntax and argument ordering):
        lea     rsi, [rdi + rdx]
        shr     rcx, 3
        jcxz    1f              # Short buffer
        mov     -8[rsi], rax
        rep stosq
        ret
1:
        mov     [rdi], al
        add     rdi, 1
        cmp     rdi, rsi
        jne     1b
        ret

The final loop can be one instruction shorter by arranging to do:
1:
        mov     [rdi+rxx], al
        add     rdi, 1
        jnz     1b
        ret

Last I looked 'jcxz' was 'ok' on all recent amd and intel cpus.
OTOH 'loop' is horrid on intel ones.

The same applies to the other versions.

I suspect it isn't worth optimising to realign misaligned buffers
they are unlikely to happen often enough.

I also think that gcc's __builtin version does some of the short
buffer optimisations already.

        David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Reply via email to