Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables an extra 128bit SSE vector epilouge when doing 512bit AVX512 vectorization in the main loop the following allows a 64bit SSE vector epilogue to be generated when the previous vector epilogue still had a vectorization factor of 16 or larger (which usually means we are operating on char data).
This effectively applies to 256bit and 512bit AVX2/AVX512 main loops, a 128bit SSE main loop would already get a 64bit SSE vector epilogue. Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three vector epilogues for 512bit and two vector epilogues when enabling 256bit vectorization. I have not added another tunable for this RFC - suggestions on how to avoid inflation there welcome. This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128 speed with -mprefer-vector-size=256 or -mprefer-vector-size=512 (the latter only when -mtune-crtl=avx512_two_epilogues is in effect). I have not done any further benchmarking, this merely shows the possibility and looks for guidance on how to expose this to the uarch tunings or to the user (at all?) if not gating on any uarch specific tuning. Note 64bit SSE isn't a native vector size so we rely on emulation being "complete" (if not epilogue vectorization will only fail, so it's "safe" in this regard). With AVX512 ISA available an alternative is a predicated epilog, but due to possible STLF issues user control would be required here. Bootstrapped on x86_64-unknown-linux-gnu, testing in progress (I expect some fallout in scans due to some extra epilogues, let's see) * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an 128bit SSE epilogue request a 64bit SSE epilogue if the 128bit SSE epilogue VF was 16 or higher. --- gcc/config/i386/i386.cc | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index c7e70c21999..f2e8de3aafc 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -25495,6 +25495,13 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32) m_suggested_epilogue_mode = V16QImode; } + /* When a 128bit SSE vectorized epilogue still has a VF of 16 or larger + enable a 64bit SSE epilogue. */ + if (loop_vinfo + && LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16 + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16) + m_suggested_epilogue_mode = V8QImode; vector_costs::finish_cost (scalar_costs); } -- 2.43.0