On Mon, 25 Nov 2024, Richard Biener wrote:

> On Mon, 25 Nov 2024, Hongtao Liu wrote:
> 
> > On Sun, Nov 24, 2024 at 8:05 PM Richard Biener <rguent...@suse.de> wrote:
> > >
> > >
> > >
> > > > Am 24.11.2024 um 09:17 schrieb Hongtao Liu <crazy...@gmail.com>:
> > > >
> > > > On Fri, Nov 22, 2024 at 9:33 PM Richard Biener <rguent...@suse.de> 
> > > > wrote:
> > > >>
> > > >> Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables
> > > >> an extra 128bit SSE vector epilouge when doing 512bit AVX512
> > > >> vectorization in the main loop the following allows a 64bit SSE
> > > >> vector epilogue to be generated when the previous vector epilogue
> > > >> still had a vectorization factor of 16 or larger (which usually
> > > >> means we are operating on char data).
> > > >>
> > > >> This effectively applies to 256bit and 512bit AVX2/AVX512 main loops,
> > > >> a 128bit SSE main loop would already get a 64bit SSE vector epilogue.
> > > >>
> > > >> Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three
> > > >> vector epilogues for 512bit and two vector epilogues when enabling
> > > >> 256bit vectorization.  I have not added another tunable for this
> > > >> RFC - suggestions on how to avoid inflation there welcome.
> > > >>
> > > >> This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128
> > > >> speed with -mprefer-vector-size=256 or -mprefer-vector-size=512
> > > >> (the latter only when -mtune-crtl=avx512_two_epilogues is in effect).
> > > >>
> > > >> I have not done any further benchmarking, this merely shows the
> > > >> possibility and looks for guidance on how to expose this to the
> > > >> uarch tunings or to the user (at all?) if not gating on any uarch
> > > >> specific tuning.
> > > >>
> > > >> Note 64bit SSE isn't a native vector size so we rely on emulation
> > > >> being "complete" (if not epilogue vectorization will only fail, so
> > > >> it's "safe" in this regard).  With AVX512 ISA available an alternative
> > > >> is a predicated epilog, but due to possible STLF issues user control
> > > >> would be required here.
> > > >>
> > > >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress
> > > >> (I expect some fallout in scans due to some extra epilogues, let's see)
> > > > I'll do some benchmark, Guess it should be ok.
> > >
> > > Any suggestion as to how (or if at all?) we should expose this to users 
> > > for tuning?
> > According to my benchmarking, it's generally better on both SRF and
> > SPR, and at most improves 14% on SRF, 9% on SPR for some specific
> > benchmark.
> > So I suggest turn it on by default, no need to put it under uarch tuning.
> 
> I'm doing some further benchmarking on Zen5 and will push then if there
> are no surprises.

r15-5650-gd9c908b7503965

Richard.

Reply via email to