On Mon, 25 Nov 2024, Richard Biener wrote: > On Mon, 25 Nov 2024, Hongtao Liu wrote: > > > On Sun, Nov 24, 2024 at 8:05 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > > > > > > > > Am 24.11.2024 um 09:17 schrieb Hongtao Liu <crazy...@gmail.com>: > > > > > > > > On Fri, Nov 22, 2024 at 9:33 PM Richard Biener <rguent...@suse.de> > > > > wrote: > > > >> > > > >> Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables > > > >> an extra 128bit SSE vector epilouge when doing 512bit AVX512 > > > >> vectorization in the main loop the following allows a 64bit SSE > > > >> vector epilogue to be generated when the previous vector epilogue > > > >> still had a vectorization factor of 16 or larger (which usually > > > >> means we are operating on char data). > > > >> > > > >> This effectively applies to 256bit and 512bit AVX2/AVX512 main loops, > > > >> a 128bit SSE main loop would already get a 64bit SSE vector epilogue. > > > >> > > > >> Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three > > > >> vector epilogues for 512bit and two vector epilogues when enabling > > > >> 256bit vectorization. I have not added another tunable for this > > > >> RFC - suggestions on how to avoid inflation there welcome. > > > >> > > > >> This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128 > > > >> speed with -mprefer-vector-size=256 or -mprefer-vector-size=512 > > > >> (the latter only when -mtune-crtl=avx512_two_epilogues is in effect). > > > >> > > > >> I have not done any further benchmarking, this merely shows the > > > >> possibility and looks for guidance on how to expose this to the > > > >> uarch tunings or to the user (at all?) if not gating on any uarch > > > >> specific tuning. > > > >> > > > >> Note 64bit SSE isn't a native vector size so we rely on emulation > > > >> being "complete" (if not epilogue vectorization will only fail, so > > > >> it's "safe" in this regard). With AVX512 ISA available an alternative > > > >> is a predicated epilog, but due to possible STLF issues user control > > > >> would be required here. > > > >> > > > >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress > > > >> (I expect some fallout in scans due to some extra epilogues, let's see) > > > > I'll do some benchmark, Guess it should be ok. > > > > > > Any suggestion as to how (or if at all?) we should expose this to users > > > for tuning? > > According to my benchmarking, it's generally better on both SRF and > > SPR, and at most improves 14% on SRF, 9% on SPR for some specific > > benchmark. > > So I suggest turn it on by default, no need to put it under uarch tuning. > > I'm doing some further benchmarking on Zen5 and will push then if there > are no surprises.
r15-5650-gd9c908b7503965 Richard.