On Sun, Nov 24, 2024 at 8:05 PM Richard Biener <rguent...@suse.de> wrote: > > > > > Am 24.11.2024 um 09:17 schrieb Hongtao Liu <crazy...@gmail.com>: > > > > On Fri, Nov 22, 2024 at 9:33 PM Richard Biener <rguent...@suse.de> wrote: > >> > >> Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables > >> an extra 128bit SSE vector epilouge when doing 512bit AVX512 > >> vectorization in the main loop the following allows a 64bit SSE > >> vector epilogue to be generated when the previous vector epilogue > >> still had a vectorization factor of 16 or larger (which usually > >> means we are operating on char data). > >> > >> This effectively applies to 256bit and 512bit AVX2/AVX512 main loops, > >> a 128bit SSE main loop would already get a 64bit SSE vector epilogue. > >> > >> Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three > >> vector epilogues for 512bit and two vector epilogues when enabling > >> 256bit vectorization. I have not added another tunable for this > >> RFC - suggestions on how to avoid inflation there welcome. > >> > >> This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128 > >> speed with -mprefer-vector-size=256 or -mprefer-vector-size=512 > >> (the latter only when -mtune-crtl=avx512_two_epilogues is in effect). > >> > >> I have not done any further benchmarking, this merely shows the > >> possibility and looks for guidance on how to expose this to the > >> uarch tunings or to the user (at all?) if not gating on any uarch > >> specific tuning. > >> > >> Note 64bit SSE isn't a native vector size so we rely on emulation > >> being "complete" (if not epilogue vectorization will only fail, so > >> it's "safe" in this regard). With AVX512 ISA available an alternative > >> is a predicated epilog, but due to possible STLF issues user control > >> would be required here. > >> > >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress > >> (I expect some fallout in scans due to some extra epilogues, let's see) > > I'll do some benchmark, Guess it should be ok. > > Any suggestion as to how (or if at all?) we should expose this to users for > tuning? According to my benchmarking, it's generally better on both SRF and SPR, and at most improves 14% on SRF, 9% on SPR for some specific benchmark. So I suggest turn it on by default, no need to put it under uarch tuning. > > Richard > > >> > >> * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an > >> 128bit SSE epilogue request a 64bit SSE epilogue if the 128bit > >> SSE epilogue VF was 16 or higher. > >> --- > >> gcc/config/i386/i386.cc | 7 +++++++ > >> 1 file changed, 7 insertions(+) > >> > >> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > >> index c7e70c21999..f2e8de3aafc 100644 > >> --- a/gcc/config/i386/i386.cc > >> +++ b/gcc/config/i386/i386.cc > >> @@ -25495,6 +25495,13 @@ ix86_vector_costs::finish_cost (const > >> vector_costs *scalar_costs) > >> && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32) > >> m_suggested_epilogue_mode = V16QImode; > >> } > >> + /* When a 128bit SSE vectorized epilogue still has a VF of 16 or larger > >> + enable a 64bit SSE epilogue. */ > >> + if (loop_vinfo > >> + && LOOP_VINFO_EPILOGUE_P (loop_vinfo) > >> + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16 > >> + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16) > >> + m_suggested_epilogue_mode = V8QImode; > >> > >> vector_costs::finish_cost (scalar_costs); > >> } > >> -- > >> 2.43.0 > > > > > > > > -- > > BR, > > Hongtao
-- BR, Hongtao