memset inline strategies for Skylake family CPUs

Jan Hubicka Mon, 05 Apr 2021 14:14:17 -0700

> >  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
> >  static stringop_algs skylake_memcpy[2] =   {
> > -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> > -             {-1, libcall, false}}}};
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}},
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}}};
> >
> >  static stringop_algs skylake_memset[2] = {
> > -  {libcall, {{6, loop_1_byte, true},
> > -             {24, loop, true},
> > -             {8192, rep_prefix_4_byte, true},
> > -             {-1, libcall, false}}},
> > -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> > -             {-1, libcall, false}}}};
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}},
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}}};
> >
> 
> If there are no objections, I will check it in on Wednesday.


On my skylake notebook if I run the benchmarking script I get:

jan@skylake:~/trunk/contrib> ./bench-stringop 64 640000000 gcc -march=native
memcpy
  block size  libcall rep1    noalg   rep4    noalg   rep8    noalg   loop    
noalg   unrl    noalg   sse     noalg   byte    PGO     dynamic    BEST
     8192000  0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 
0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18    0:00.19 sse
      819200  0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 
0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09    0:00.09 
libcall
       81920  0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 
0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06    0:00.06 
libcall
       20480  0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 
0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09    0:00.05 
rep1noalign
        8192  0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 
0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05    0:00.04 
rep1noalign
        4096  0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 
0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07    0:00.05 
libcall
        2048  0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 
0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07    0:00.04 
libcall
        1024  0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 
0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06    0:00.06 
libcall
         512  0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 
0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08    0:00.06 
libcall
         256  0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 
0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12    0:00.10 
libcall
         128  0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 
0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17    0:00.15 
libcall
          64  0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 
0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28    0:00.25 loop
          48  0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 
0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31    0:00.32 unrl
          32  0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 
0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40    0:00.40 unrl
          24  0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 
0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50    0:00.50 
unrlnoalign
          16  0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 
0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.91    0:00.77 
unrlnoalign
          14  0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 
0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.99    0:00.94 unrl
          12  0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 
0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.10    0:01.02 unrl
          10  0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 
0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.38    0:01.23 
unrlnoalign
           8  0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 
0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.55    0:01.38 unrl
So indeed rep byte seems consistently outperforming rep4/rep8 however
urolled variant seems to be better than rep byte for small block sizes.
Do you have some data for blocks in size 8...256 to be faster with rep1
compared to unrolled loop for perhaps more real world benchmarks?

The difference seems to get quite big for small locks in range 8...16
bytes.  I noticed that before and sort of conlcuded that it is probably
the branch prediction playing relatively well for those small block
sizes. On the other hand winding up the relatively long unrolled loop is
not very cool just to catch this case.

Do you know what of the three changes (preferring reps/stosb,
CLEAR_RATIO and algorithm choice changes) cause the two speedups
on eebmc?

Honza
> 
> -- 
> H.J.

Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

Reply via email to