Re: [FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm

James Almer Sat, 25 May 2024 17:09:11 -0700

On 5/25/2024 9:02 PM, Lynne via ffmpeg-devel wrote:

On 26/05/2024 00:45, James Almer wrote:
On 5/25/2024 7:31 PM, James Almer wrote:
On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote:
The inline asm function had issues running under checkasm.
So I came to finish what I started, and wrote the last part
of LPC computation in assembly.
autocorr_10_c: 135525.8
autocorr_10_sse2: 50729.8
autocorr_10_fma3: 19007.8
autocorr_30_c: 390100.8
autocorr_30_sse2: 142478.8
autocorr_30_fma3: 50559.8
autocorr_32_c: 407058.3
autocorr_32_sse2: 151633.3
autocorr_32_fma3: 50517.3
---
libavcodec/x86/lpc.asm | 91+++++++++++++++++++++++++++++++++++++++
  libavcodec/x86/lpc_init.c | 87 ++++---------------------------------
  2 files changed, 100 insertions(+), 78 deletions(-)

diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm
index a585c17ef5..790841b7f4 100644
--- a/libavcodec/x86/lpc.asm
+++ b/libavcodec/x86/lpc.asm
@@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0
  dec_tab_scalar: times 2 dq -1.0
  seq_tab_sse2: dq 1.0, 0.0
+autoc_init_tab: times 4 dq 1.0
+
  SECTION .text
  %macro APPLY_WELCH_FN 0
@@ -261,3 +263,92 @@ APPLY_WELCH_FN
  INIT_YMM avx2
  APPLY_WELCH_FN
  %endif
+
+%macro COMPUTE_AUTOCORR_FN 0
+cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc,lag_p, data_l, len_p
Already mentioned, but it should be 3 not 8.
+
+    shl lagd, 3
+    shl lenq, 3
+    xor lag_pq, lag_pq
+
+.lag_l:
+    movaps m8, [autoc_init_tab]
m2
+
+    mov len_pq, lag_pq
+
+    lea data_lq, [lag_pq + mmsize - 8]
+    neg data_lq                     ; -j - mmsize
+    add data_lq, dataq              ; data[-j - mmsize]
+.len_l:
+    ; We waste the upper value here on SSE2,
+    ; but we use it on AVX.
+    movupd xm0, [dataq + len_pq]    ; data[i]
movsd
+    movupd m1, [data_lq + len_pq]   ; data[i - j]
+
+%if cpuflag(avx)
%if mmsize == 32 here and everywhere else.
+    vbroadcastsd m0, xm0
This is AVX2. AVX only has memory input argument. So use that andsave the movsd from above for the FMA3 version.
+    vperm2f128 m1, m1, m1, 0x01
Aren't you loading 16 extra bytes for no reason if you're just goingto use the upper 16 bytes from the load above?
Nevermind, this is swapping lanes.
That aside, these versions are barely better and sometimes worse inall my tests on win64 with GCC with certain seeds.
For example, seed 4022958484 gives me:

autocorr_10_c: 21345.6
autocorr_10_sse2: 16434.6
autocorr_10_fma3: 24154.6
autocorr_30_c: 59239.1
autocorr_30_sse2: 46114.6
autocorr_30_fma3: 64147.1
autocorr_32_c: 63022.1
autocorr_32_sse2: 50040.1
autocorr_32_fma3: 66594.1

But seed 2236774811 gives me:

autocorr_10_c: 37135.3
autocorr_10_sse2: 26492.3
autocorr_10_fma3: 32943.3
autocorr_30_c: 102266.8
autocorr_30_sse2: 72933.3
autocorr_30_fma3: 85808.3
autocorr_32_c: 106537.8
autocorr_32_sse2: 77623.3
autocorr_32_fma3: 85844.3
But if i force len to always be 4999 instead of its value varyingdepending on seed, i consistently get things like:
autocorr_10_c: 40447.3
autocorr_10_sse2: 39526.8
autocorr_10_fma3: 42955.3
autocorr_30_c: 111362.3
autocorr_30_sse2: 111408.3
autocorr_30_fma3: 116781.8
autocorr_32_c: 122388.3
autocorr_32_sse2: 119125.3
autocorr_32_fma3: 114239.3
It would help if someone else could confirm this, but overall i don'tsee any worthwhile gain here. The old inline version, for those seedswhere it worked, was somewhat faster.
The metrics given are on Zen 3, with Clang with compiler optimizationsdisabled.We do not rely on compiler optimizations, and have plenty of assemblywhich turns out to be slower than modern compilers autovectorizing (eventhough we disable tree vectorization on GCC, that does not apply tosimple loops like this one). On the other hand, we also support ancientcompilers, and compilers which have no understanding of vectorization atall.

Tree vectorization is disabled everywhere, including my target (GCC 14,mingw-w64, Alder Lake i7).

To illustrate how different results can look on different arches andcompilers, and even platforms (you mentioned you tested only on win64):


Zen 3, gcc-9, O2:
autocorr_10_c: 48796.8
autocorr_10_sse2: 39571.8
autocorr_10_fma3: 30272.8
autocorr_30_c: 138499.3
autocorr_30_sse2: 114091.3
autocorr_30_fma3: 82114.3
autocorr_32_c: 146466.8
autocorr_32_sse2: 118400.8
autocorr_32_fma3: 80473.8

Zen 3, gcc-14, O2:
autocorr_10_c: 44981.3
autocorr_10_sse2: 36481.3
autocorr_10_fma3: 18418.8
autocorr_30_c: 129462.8
autocorr_30_sse2: 104175.3
autocorr_30_fma3: 48670.3
autocorr_32_c: 135625.3
autocorr_32_sse2: 109079.8
autocorr_32_fma3: 48670.3

Zen 3, clang-18, O2:
autocorr_10_c: 51872.6
autocorr_10_sse2: 48311.1
autocorr_10_fma3: 30070.1
autocorr_30_c: 145899.6
autocorr_30_sse2: 135793.1
autocorr_30_fma3: 79922.6
autocorr_32_c: 160443.1
autocorr_32_sse2: 147591.1
autocorr_32_fma3: 80075.6

Skylake, gcc-14, O2:
autocorr_10_c: 149251.0
autocorr_10_sse2: 133769.5
autocorr_10_fma3: 72886.0
autocorr_30_c: 396145.0
autocorr_30_sse2: 376618.5
autocorr_30_fma3: 194116.5
autocorr_32_c: 413219.0
autocorr_32_sse2: 400867.5
autocorr_32_fma3: 194117.5

Skylake, clang-18, O2:
autocorr_10_c: 153825.3
autocorr_10_sse2: 133774.3
autocorr_10_fma3: 72883.8
autocorr_30_c: 398339.8
autocorr_30_sse2: 376603.8
autocorr_30_fma3: 194098.8
autocorr_32_c: 432183.3
autocorr_32_sse2: 422583.8
autocorr_32_fma3: 194094.3

I see no such boost at all. You're getting twice the performance on fma3than sse2 whereas i get fma3 worse than sse2 in many cases. There issomething fishy going on, hence me asking others to check to see if theycan reproduce it.

<Insert your favorite decade old compiler here>
But again, this is irrelevant, as we do not rely on compilers foroptimizations. We help them as much as we can, and when they work, itsnice, but that is in no way reliable, especially to turn down a patchlike this.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm

Reply via email to