12 Jul 2021, 11:29 by alankelly-at-google....@ffmpeg.org: > On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alanke...@google.com> wrote: > >> On Fri, Jun 25, 2021 at 10:40 AM Lynne <d...@lynne.ee> wrote: >> >>> Jun 25, 2021, 09:54 by alankelly-at-google....@ffmpeg.org: >>> >>> > Broadwell and later and Zen3 and later have fast gather instructions. >>> > --- >>> > Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on >>> Broadwell, >>> > and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3. >>> > libavutil/cpu.h | 2 ++ >>> > libavutil/x86/cpu.c | 18 ++++++++++++++++-- >>> > libavutil/x86/cpu.h | 1 + >>> > 3 files changed, 19 insertions(+), 2 deletions(-) >>> > >>> >>> No, we really don't need more FAST/SLOW flags, especially for >>> something like this which is just fixable by _not_using_vgather_. >>> Take a look at libavutil/x86/tx_float.asm, we only use vgather >>> if it's guaranteed to either be faster for what we're gathering or >>> is just as fast "slow". If neither is true, we use manual lookups, >>> which is actually advantageous since for AVX2 we can interleave >>> the lookups that happen in each lane. >>> >>> Even if we disregard this, I've extensively benchmarked vgather >>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly >>> a great vgather improvement to be found in Zen 3 to justify >>> using a new CPU flag for this. >>> _______________________________________________ >>> ffmpeg-devel mailing list >>> ffmpeg-devel@ffmpeg.org >>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>> >>> To unsubscribe, visit link above, or email >>> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". >>> >> >> Thanks for your response. I'm not against finding a cleaner way of >> enabling/disabling the code which will be protected by this flag. However, >> the manual lookups solution proposed will not work in this case, the avx2 >> version of hscale will only be faster if fast gathers are available, >> otherwise, the ssse3 version should be used. >> >> I haven't got access to a Zen3 so I can't comment on the performance. I >> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about >> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell >> has similar performance to Zen2. >> >> Is there a proxy which could be used for detecting Broadwell or Skylake >> and later? AVX512 seems too strict as there are Skylake chips without >> AVX512. Thanks >> > > Hi, > > I will paste the performance figures from the thread for the other part of > this patch here so that the justification for this flag is clearer: > > Skylake Haswell > hscale_8_to_15_width4_ssse3 761.2 760 > hscale_8_to_15_width4_avx2 468.7 957 > hscale_8_to_15_width8_ssse3 1170.7 1032 > hscale_8_to_15_width8_avx2 865.7 1979 > hscale_8_to_15_width12_ssse3 2172.2 2472 > hscale_8_to_15_width12_avx2 1245.7 2901 > hscale_8_to_15_width16_ssse3 2244.2 2400 > hscale_8_to_15_width16_avx2 1647.2 3681 > > As you can see, it is catastrophic on Haswell and older chips but the gains > on Skylake are impressive. > As I don't have performance figures for Zen 3, I can disable this feature > on all cpus apart from Broadwell and later as you say that there is no > worthwhile improvement on Zen3. Is this OK with you? >
It's not that catastrophic. Since Haswell CPUs generally don't have large AVX2 gains, could you just exclude Haswell only from EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST to enable those functions? _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".