On Sun, Nov 24, 2024 at 8:05 PM Richard Biener <rguent...@suse.de> wrote:
>
>
>
> > Am 24.11.2024 um 09:17 schrieb Hongtao Liu <crazy...@gmail.com>:
> >
> > On Fri, Nov 22, 2024 at 9:33 PM Richard Biener <rguent...@suse.de> wrote:
> >>
> >> Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables
> >> an extra 128bit SSE vector epilouge when doing 512bit AVX512
> >> vectorization in the main loop the following allows a 64bit SSE
> >> vector epilogue to be generated when the previous vector epilogue
> >> still had a vectorization factor of 16 or larger (which usually
> >> means we are operating on char data).
> >>
> >> This effectively applies to 256bit and 512bit AVX2/AVX512 main loops,
> >> a 128bit SSE main loop would already get a 64bit SSE vector epilogue.
> >>
> >> Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three
> >> vector epilogues for 512bit and two vector epilogues when enabling
> >> 256bit vectorization.  I have not added another tunable for this
> >> RFC - suggestions on how to avoid inflation there welcome.
> >>
> >> This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128
> >> speed with -mprefer-vector-size=256 or -mprefer-vector-size=512
> >> (the latter only when -mtune-crtl=avx512_two_epilogues is in effect).
> >>
> >> I have not done any further benchmarking, this merely shows the
> >> possibility and looks for guidance on how to expose this to the
> >> uarch tunings or to the user (at all?) if not gating on any uarch
> >> specific tuning.
> >>
> >> Note 64bit SSE isn't a native vector size so we rely on emulation
> >> being "complete" (if not epilogue vectorization will only fail, so
> >> it's "safe" in this regard).  With AVX512 ISA available an alternative
> >> is a predicated epilog, but due to possible STLF issues user control
> >> would be required here.
> >>
> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress
> >> (I expect some fallout in scans due to some extra epilogues, let's see)
> > I'll do some benchmark, Guess it should be ok.
>
> Any suggestion as to how (or if at all?) we should expose this to users for 
> tuning?
According to my benchmarking, it's generally better on both SRF and
SPR, and at most improves 14% on SRF, 9% on SPR for some specific
benchmark.
So I suggest turn it on by default, no need to put it under uarch tuning.
>
> Richard
>
> >>
> >>        * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an
> >>        128bit SSE epilogue request a 64bit SSE epilogue if the 128bit
> >>        SSE epilogue VF was 16 or higher.
> >> ---
> >> gcc/config/i386/i386.cc | 7 +++++++
> >> 1 file changed, 7 insertions(+)
> >>
> >> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> >> index c7e70c21999..f2e8de3aafc 100644
> >> --- a/gcc/config/i386/i386.cc
> >> +++ b/gcc/config/i386/i386.cc
> >> @@ -25495,6 +25495,13 @@ ix86_vector_costs::finish_cost (const 
> >> vector_costs *scalar_costs)
> >>               && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32)
> >>        m_suggested_epilogue_mode = V16QImode;
> >>     }
> >> +  /* When a 128bit SSE vectorized epilogue still has a VF of 16 or larger
> >> +     enable a 64bit SSE epilogue.  */
> >> +  if (loop_vinfo
> >> +      && LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> >> +      && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16
> >> +      && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16)
> >> +    m_suggested_epilogue_mode = V8QImode;
> >>
> >>   vector_costs::finish_cost (scalar_costs);
> >> }
> >> --
> >> 2.43.0
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

Reply via email to