On Tue, Apr 6, 2021 at 2:51 AM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> > > Do you know what of the three changes (preferring reps/stosb,
> > > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > > on eebmc?
> >
> > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP
> >
> > This loop is transformed to builtin_memcpy and builtin_memset with size 280.
> >
> > Current strategy for skylake is {512, unrolled_loop, false} for such
> > size, so it will generate unrolled loops with mov, while the patch
> > generates memcpy/memset libcall and uses vector move.
>
> This is good - I originally set the table based on this
> micro-benchmarking script and apparently glibc used at that time had
> more expensive memcpy for small blocks.
>
> One thing to consider is, however, that calling external memcpy has also
> additional cost of clobbering all caller saved registers.  Especially
> for code that uses SSE this is painful since all needs to go to stack in
> that case. So I am not completely sure how representative the
> micro-benchmark is to this respect since it does not use any SSE and
> register pressure is generally small.
>
> So with current glibc it seems libcall is win for blocks of size greater
> than 64 or 128 at least if the register pressure is not big.
> With this respect your change looks good.
> > >
> > > My patch generates "rep movsb" only in a very limited cases:
> > >
> > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > >    fixed and known.
> > > 2. Inline only if data size is known to be <= 256.
> > >    a. Use "rep movsb/stosb" with a simple code sequence if the data size
> > >       is a constant.
> > >    b. Use loop if data size is not a constant.
>
> Aha, this is very hard to read from the algorithm descriptor.  So we
> still have the check that maxsize==minsize and use rep mosb only for
> constant sized blocks when the corresponding TARGET macro is defined.
>
> I think it would be more readable if we introduced rep_1_byte_constant.
> The descriptor is supposed to read as a sequence of rules where fist
> applies.  It is not obvious that we have another TARGET_* macro that
> makes rep_1_byte to be ignored in some cases.
> (TARGET macro will also interfere with the microbenchmarking script).
>
> Still I do not understand why compile time constant makes rep mosb/stosb
> better than loop. Is it CPU special casing it at decoder time and
> requiring explicit mov instruction? Or is it only becuase rep mosb is
> not good for blocks smaller than 128bit?

Non constant "rep movsb" triggers more machine clear events:

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/mo-machine-clear-overhead.html

in hot loops of some workloads.

> > >
> > > As a result,  "rep stosb" is generated only when 128 < data size < 256
> > > with -mno-sse.
> > >
> > > > Do you have some data for blocks in size 8...256 to be faster with rep1
> > > > compared to unrolled loop for perhaps more real world benchmarks?
> > >
> > > "rep movsb" isn't generated with my patch in this case since
> > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> > > XMM registers.
>
> OK, so I guess:
>   {libcall,
>    {{256, rep_1_byte, true},
>     {256, unrolled_loop, false},
>     {-1, libcall, false}}},
>   {libcall,
>    {{256, rep_1_loop, true},
>     {256, unrolled_loop, false},
>     {-1, libcall, false}}}};
>
> may still perform better but the differnece between loop and unrolled
> loop is within 10% margin..
>
> So i guess patch is OK and we should look into cleaning up the
> descriptors.  I can make patch for that once I understand the logic above.

I am checking in my patch.  We improve it for GCC 12.  We will also revisit:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90773

for GCC 12.

Thanks.

-- 
H.J.

Reply via email to