Sep 24, 2022, 21:40 by mar...@martin.st: > On Sat, 24 Sep 2022, Hendrik Leppkes wrote: > >> On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.lepp...@gmail.com> wrote: >> >>> >>> On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <mar...@martin.st> wrote: >>> > >>> > On Sat, 24 Sep 2022, Lynne wrote: >>> > >>> > > This commit changes both the encoder and decoder to use the new lavu/tx >>> > > code, >>> > > which has faster C transforms and more assembly optimizations. >>> > >>> > What's the case of e.g. 32 bit arm - that does have a bunch of fft and >>> > mdct assembly, but is that something that ends up used by opus today, or >>> > does the mdct15 stuff use separate codepaths that aren't optimized there >>> > today yet? >>> > >>> >>> mdct15 only has some x86 assembly, nothing for ARM. >>> Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly. >>> >> >> Actually, I missed that the mdct15 internally uses one of the normal >> fft functions for a part of the calculation, but how much impact that >> has on performance vs. the new code where the C alone is quite a bit >> faster would have to be confirmed by Lynne. >> > > Ok, fair enough. >
I did some benchmarking. Just lavc's C nptwo MDCT is 10% slower than lavu's C nptwo MDCT. I don't have 32bit ARM hardware to test on, but I do have an aarch64 A53 core. On it, the performance difference with all optimizations with this patch on or off was that the decoder became 15% faster. With lavu/tx's aarch64 assembly disabled to simulate arm32's situation, the decoder was still 10% faster overall. It's probably going to be similar on arm32. On x86, the performance difference between the decoder without this patch and the decoder with this patch but all lavu/tx asm disabled was only 10% slower. With assembly enabled and this patch, the decoder is 15% faster overall on an Alder Lake system. As for the overall decoding time consumption for Opus, the MDCT is very far behind the largest overhead - coefficient decoding (on x86 with optimizations, 50% of the time is spent there, whilst only 5% on the MDCT in total). It's a very optimized decoder. In general, for the transform alone, a C non-power-of-two lavu MDCT for the lengths used by Opus, the performance difference for using AVX vs C for the ptwo part is on the order of 20% slower transforms for 960pt, and SSE vs C for 240pt is also around 20%. Most of this is due to the function call overhead, (framesize/2)/ptwo = 120, 60, 30 and 15 calls to ptwo FFTs per transform. The assembly function largely eliminates this overhead by linking assembly functions together with a minimal 'ABI'. > What about ac3dsp then - that one seems like it's fairly optimized for arm? > Haven't touched them, they're still being used. Unfortunately, for AC3, the full MDCT optimizations in lavc do make a difference and the overall decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, which should make the decoder at least 10% faster with lavu/tx. For Opus, the used ptwo lengths are (framesize/2)/15 = 32, 16, 8 and 4pt FFTs. If you'd like to help out, I've documented the C factorizations used in docs/transforms.md. You could also try porting the existing assembly. It should be trivial if they don't use the upper half of the tables. lavc's and lavu's FFT tables differ by size - lavu's are half the size of lavc's tables, because lavc's tables contain the multiplication factors mirrored after the halfway point. That's used by the RDFT, and by the x86 assembly. It's not worth replicating this, the memory overhead is just too much, especially on bandwidth starved cores. If the arm32 assembly uses the upper part, it shouldn't be too hard to make it read from both the start and end point of the exptab array in the recombination function of ptwo transforms. The MDCT asm can be ported in a straightforward way and would improve both decoders significantly. If the ABI is simpler than x86's, you could even make the asm transform call into C functions, which would lessen the work. A lot of the MDCT overhead is in the gather and multiplication part, whilst the FFT is limited by mostly adds and memory bandwidth, so just with MDCT assembly the decoder would get a lot faster. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".