On Mon, 25 Nov 2024, Hongtao Liu wrote: > On Sun, Nov 24, 2024 at 8:05 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > > > > Am 24.11.2024 um 09:17 schrieb Hongtao Liu <crazy...@gmail.com>: > > > > > > On Fri, Nov 22, 2024 at 9:33 PM Richard Biener <rguent...@suse.de> wrote: > > >> > > >> Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables > > >> an extra 128bit SSE vector epilouge when doing 512bit AVX512 > > >> vectorization in the main loop the following allows a 64bit SSE > > >> vector epilogue to be generated when the previous vector epilogue > > >> still had a vectorization factor of 16 or larger (which usually > > >> means we are operating on char data). > > >> > > >> This effectively applies to 256bit and 512bit AVX2/AVX512 main loops, > > >> a 128bit SSE main loop would already get a 64bit SSE vector epilogue. > > >> > > >> Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three > > >> vector epilogues for 512bit and two vector epilogues when enabling > > >> 256bit vectorization. I have not added another tunable for this > > >> RFC - suggestions on how to avoid inflation there welcome. > > >> > > >> This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128 > > >> speed with -mprefer-vector-size=256 or -mprefer-vector-size=512 > > >> (the latter only when -mtune-crtl=avx512_two_epilogues is in effect). > > >> > > >> I have not done any further benchmarking, this merely shows the > > >> possibility and looks for guidance on how to expose this to the > > >> uarch tunings or to the user (at all?) if not gating on any uarch > > >> specific tuning. > > >> > > >> Note 64bit SSE isn't a native vector size so we rely on emulation > > >> being "complete" (if not epilogue vectorization will only fail, so > > >> it's "safe" in this regard). With AVX512 ISA available an alternative > > >> is a predicated epilog, but due to possible STLF issues user control > > >> would be required here. > > >> > > >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress > > >> (I expect some fallout in scans due to some extra epilogues, let's see) > > > I'll do some benchmark, Guess it should be ok. > > > > Any suggestion as to how (or if at all?) we should expose this to users for > > tuning? > According to my benchmarking, it's generally better on both SRF and > SPR, and at most improves 14% on SRF, 9% on SPR for some specific > benchmark. > So I suggest turn it on by default, no need to put it under uarch tuning.
I'm doing some further benchmarking on Zen5 and will push then if there are no surprises. Richard. > > > > Richard > > > > >> > > >> * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an > > >> 128bit SSE epilogue request a 64bit SSE epilogue if the 128bit > > >> SSE epilogue VF was 16 or higher. > > >> --- > > >> gcc/config/i386/i386.cc | 7 +++++++ > > >> 1 file changed, 7 insertions(+) > > >> > > >> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > >> index c7e70c21999..f2e8de3aafc 100644 > > >> --- a/gcc/config/i386/i386.cc > > >> +++ b/gcc/config/i386/i386.cc > > >> @@ -25495,6 +25495,13 @@ ix86_vector_costs::finish_cost (const > > >> vector_costs *scalar_costs) > > >> && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32) > > >> m_suggested_epilogue_mode = V16QImode; > > >> } > > >> + /* When a 128bit SSE vectorized epilogue still has a VF of 16 or > > >> larger > > >> + enable a 64bit SSE epilogue. */ > > >> + if (loop_vinfo > > >> + && LOOP_VINFO_EPILOGUE_P (loop_vinfo) > > >> + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16 > > >> + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16) > > >> + m_suggested_epilogue_mode = V8QImode; > > >> > > >> vector_costs::finish_cost (scalar_costs); > > >> } > > >> -- > > >> 2.43.0 > > > > > > > > > > > > -- > > > BR, > > > Hongtao > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)