> Am 11.11.2024 um 18:09 schrieb Jan Hubicka <hubi...@ucw.cz>:
>
>
>>
>> The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the
>> vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512
>> vectorized loops when set. The tuning is enabled by default for Zen4
>> and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both
>> in performance and overall code size. In particular it speeds up
>> 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code
>> at the moment.
>>
>> Re-bootstrap and regtest running on x86_64-unknown-linux-gnu
>> (I've added znver4 to the defaults after benchmarking there and have
>> to double-check no -mtune=znver4 testcase is affected). Note that
>> znver4|znver5 is all AMD CPUs with AVX512.
>>
>> I did not do any benchmarking on Intel CPUs with AVX512 but I do
>> expect 525.x264_r to improve there as well.
>>
>> OK for trunk if testing succeeds?
>>
>> Thanks,
>> Richard.
>>
>> * config/i386/i386.cc (ix86_vector_costs::finish_cost): Set
>> m_suggested_epilogue_mode according to X86_TUNE_AVX512_TWO_EPILOGUES.
>> * config/i386/x86-tune.def (X86_TUNE_AVX512_TWO_EPILOGUES): Add.
>> Enable for znver4 and znver5.
>
> OK,
> I wonder - are there Intel cpus for which we do not exable
> AVX256_OPTIMAL?
All of the Intel cores with AVX512 use AVX256_OPTIMAL, but it might pay off
when users explicitly select 512bit vectorization.
But as I didn’t do any measurements I’ve refrained from guessing here.
Richard
>
> Honza