On Mon, Jan 11, 2021 at 1:26 AM Carl Eugen Hoyos <ceffm...@gmail.com> wrote:
> Am So., 10. Jan. 2021 um 19:55 Uhr schrieb Lynne <d...@lynne.ee>: > > > > Jan 10, 2021, 17:43 by reimar.doeffin...@gmx.de: > > > > > From: Reimar Döffinger <reimar.doeffin...@gmx.de> > > > > > > This requests loops to be vectorized using SIMD > > > instructions. > > > The performance increase is far from hand-optimized > > > assembly but still significant over the plain C version. > > > Typical values are a 2-4x speedup where a hand-written > > > version would achieve 4x-10x. > > > So it is far from a replacement, however some architures > > > will get hand-written assembler quite late or not at all, > > > and this is a good improvement for a trivial amount of work. > > > The cause, besides the compiler being a compiler, is > > > usually that it does not manage to use saturating instructions > > > and thus has to use 32-bit operations where actually > > > saturating 16-bit operations would be sufficient. > > > Other causes are for example the av_clip functions that > > > are not ideal for vectorization (and even as scalar code > > > not optimal for any modern CPU that has either CSEL or > > > MAX/MIN instructions). > > > And of course this only works for relatively simple > > > loops, the IDCT functions for example seemed not possible > > > to optimize that way. > > > Also note that while clang may accept the code and sometimes > > > produces warnings, it does not seem to do anything actually > > > useful at all. > > > Here are example measurements using gcc 10 under Linux (in a VM > unfortunately) > > > on AArch64 on Apple M1: > > > Commad: > > > time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit > -threads 1 -noframedrop > > > > > > Original code: > > > real 0m19.572s > > > user 0m23.386s > > > sys 0m0.213s > > > > > > Changing all put_hevc: > > > real 0m15.648s > > > user 0m19.503s (83.4% of original) > > > sys 0m0.186s > > > > > > In addition changing add_residual: > > > real 0m15.424s > > > user 0m19.278s (82.4% of original) > > > sys 0m0.133s > > > > > > In addition changing planar copy dither: > > > real 0m15.040s > > > user 0m18.874s (80.7% of original) > > > sys 0m0.168s > > > > > > > I think I have to disagree. > > > The performance gains are marginal > > This sounds wrong. > I disagree with Carl. > > Carl Eugen > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".