memset inline strategies for Ice Lake

Hongyu Wang via Gcc-patches Wed, 31 Mar 2021 22:57:48 -0700

> > So in neither of those scenarios testing maxsize=minsize alone makes too
> > much sense to me... What was the original motivation for differentiating
> > between precisely known size?


There is a case that could meet small maxsize. https://godbolt.org/z/489Tf7ssj

typedef unsigned char e_u8;
#define MAXBC 8
void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
{
  e_u8 b[4][MAXBC];
  int i, j;

  for(i = 0; i < 4; i++)
    for(j = 0; j < BC; j++) a[i][j] = b[i][j];
}

Where BC is unsigned char so maxsize will be 256.

If we set stringop_alg to rep_1_byte the code could be like

 movzbl  %sil, %r8d
 movq    %rdi, %rdx
 leaq    -40(%rsp), %rax
 movq    %r8, %r9
 leaq    -8(%rsp), %r10

 testb   %r9b, %r9b
 je      .L5
 movq    %rdx, %rdi
 movq    %rax, %rsi
 movq    %r8, %rcx
 rep movsb

 addq    $8, %rax
 addq    $8, %rdx
 cmpq    %r10, %rax
 jne     .L2
 ret

In our test we found this is much slower than current trunk because
rep movsb triggers machine clear events, while in the current trunk
such small size is handled in the loop mov epilogue and rep movsq is
never executed.

So here we disabled inline for unknown size to avoid potential issues like this.

H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年4月1日周四 上午1:55写道：
>
> On Wed, Mar 31, 2021 at 10:43 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> >
> > > > Reading through the optimization manual it seems that mosvb is fast for
> > > > small block no matter if the size is hard wired. In that case you
> > > > probably want to check whetehr max_size or expected_size is known to be
> > > > small rather than max_size == min_size and both being small.
> > > >
> > > > But it depends on what CPU really does.
> > > > Honza
> > >
> > > For small data size, rep movsb is faster only under certain conditions.   
> > > We
> > > can continue fine tuning rep movsb.
> >
> > OK, I however wonder why you need condtion maxsize=minsize.
> >  - If CPU is looking for movl $cst, %rcx than we probably want to be
> >    sure that it is not moved away fro rep ;movsb by adding fused pattern
> >  - If rep movsb is slower than loop for very small blocks then you want
> >    to set lower bound on minsize & expected size, but you do not need
> >    to require maxsize=minsize
> >  - If rep movsb is slower than sequence of moves for small blocks then
> >    one needs to tweak move by pieces
> >  - If rep movsb is slower for larger blocks than you want to test
> >    maxsize and expected size
> > So in neither of those scenarios testing maxsize=minsize alone makes too
> > much sense to me... What was the original motivation for differentiating
> > between precisely known size?
> >
> > I am mostly curious because it is not that uncomon to have small maxsize
> > because we are able to track the object size and using short sequence
> > for those would be nice.
> >
> > Having minsize non-trivial may not be that uncommon these days either
> > given that we track value ranges (and under assumption that
> > memcpy/memset expanders was updated to take these into account).
> >
>
> Hongyu has done some analysis on this.  Hongyu, can you share what
> you got?
>
> Thanks.
>
> --
> H.J.

-- 
Regards,

Hongyu, Wang

Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake

Reply via email to