> Am 24.11.2024 um 09:17 schrieb Hongtao Liu <crazy...@gmail.com>:
>
> On Fri, Nov 22, 2024 at 9:33 PM Richard Biener <rguent...@suse.de> wrote:
>>
>> Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables
>> an extra 128bit SSE vector epilouge when doing 512bit AVX512
>> vectorization in the main loop the following allows a 64bit SSE
>> vector epilogue to be generated when the previous vector epilogue
>> still had a vectorization factor of 16 or larger (which usually
>> means we are operating on char data).
>>
>> This effectively applies to 256bit and 512bit AVX2/AVX512 main loops,
>> a 128bit SSE main loop would already get a 64bit SSE vector epilogue.
>>
>> Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three
>> vector epilogues for 512bit and two vector epilogues when enabling
>> 256bit vectorization. I have not added another tunable for this
>> RFC - suggestions on how to avoid inflation there welcome.
>>
>> This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128
>> speed with -mprefer-vector-size=256 or -mprefer-vector-size=512
>> (the latter only when -mtune-crtl=avx512_two_epilogues is in effect).
>>
>> I have not done any further benchmarking, this merely shows the
>> possibility and looks for guidance on how to expose this to the
>> uarch tunings or to the user (at all?) if not gating on any uarch
>> specific tuning.
>>
>> Note 64bit SSE isn't a native vector size so we rely on emulation
>> being "complete" (if not epilogue vectorization will only fail, so
>> it's "safe" in this regard). With AVX512 ISA available an alternative
>> is a predicated epilog, but due to possible STLF issues user control
>> would be required here.
>>
>> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress
>> (I expect some fallout in scans due to some extra epilogues, let's see)
> I'll do some benchmark, Guess it should be ok.
Any suggestion as to how (or if at all?) we should expose this to users for
tuning?
Richard
>>
>> * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an
>> 128bit SSE epilogue request a 64bit SSE epilogue if the 128bit
>> SSE epilogue VF was 16 or higher.
>> ---
>> gcc/config/i386/i386.cc | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
>> index c7e70c21999..f2e8de3aafc 100644
>> --- a/gcc/config/i386/i386.cc
>> +++ b/gcc/config/i386/i386.cc
>> @@ -25495,6 +25495,13 @@ ix86_vector_costs::finish_cost (const vector_costs
>> *scalar_costs)
>> && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32)
>> m_suggested_epilogue_mode = V16QImode;
>> }
>> + /* When a 128bit SSE vectorized epilogue still has a VF of 16 or larger
>> + enable a 64bit SSE epilogue. */
>> + if (loop_vinfo
>> + && LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>> + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16
>> + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16)
>> + m_suggested_epilogue_mode = V8QImode;
>>
>> vector_costs::finish_cost (scalar_costs);
>> }
>> --
>> 2.43.0
>
>
>
> --
> BR,
> Hongtao