On 5/25/2024 8:24 PM, Lynne via ffmpeg-devel wrote:
On 26/05/2024 00:31, James Almer wrote:
On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote:
The inline asm function had issues running under checkasm.
So I came to finish what I started, and wrote the last part
of LPC computation in assembly.

autocorr_10_c: 135525.8
autocorr_10_sse2: 50729.8
autocorr_10_fma3: 19007.8
autocorr_30_c: 390100.8
autocorr_30_sse2: 142478.8
autocorr_30_fma3: 50559.8
autocorr_32_c: 407058.3
autocorr_32_sse2: 151633.3
autocorr_32_fma3: 50517.3
---
  libavcodec/x86/lpc.asm    | 91 +++++++++++++++++++++++++++++++++++++++
  libavcodec/x86/lpc_init.c | 87 ++++---------------------------------
  2 files changed, 100 insertions(+), 78 deletions(-)

diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm
index a585c17ef5..790841b7f4 100644
--- a/libavcodec/x86/lpc.asm
+++ b/libavcodec/x86/lpc.asm
@@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0
  dec_tab_scalar: times 2 dq -1.0
  seq_tab_sse2: dq 1.0, 0.0
+autoc_init_tab: times 4 dq 1.0
+
  SECTION .text
  %macro APPLY_WELCH_FN 0
@@ -261,3 +263,92 @@ APPLY_WELCH_FN
  INIT_YMM avx2
  APPLY_WELCH_FN
  %endif
+
+%macro COMPUTE_AUTOCORR_FN 0
+cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, lag_p, data_l, len_p

Already mentioned, but it should be 3 not 8.

Already done, as said on IRC not 10 minutes after I submitted it.


+
+    shl lagd, 3
+    shl lenq, 3
+    xor lag_pq, lag_pq
+
+.lag_l:
+    movaps m8, [autoc_init_tab]

m2

+
+    mov len_pq, lag_pq
+
+    lea data_lq, [lag_pq + mmsize - 8]
+    neg data_lq                     ; -j - mmsize
+    add data_lq, dataq              ; data[-j - mmsize]
+.len_l:
+    ; We waste the upper value here on SSE2,
+    ; but we use it on AVX.
+    movupd xm0, [dataq + len_pq]    ; data[i]

movsd

Fixed.


+    movupd m1, [data_lq + len_pq]   ; data[i - j]
+
+%if cpuflag(avx)

%if mmsize == 32 here and everywhere else.

Done.


+    vbroadcastsd m0, xm0

This is AVX2. AVX only has memory input argument. So use that and save the movsd from above for the FMA3 version.

+    vperm2f128 m1, m1, m1, 0x01

Aren't you loading 16 extra bytes for no reason if you're just going to use the upper 16 bytes from the load above?

Lane swapped, like you mentioned.

+%endif
+
+    shufpd m0, m0, m0, 1100b

The last argument has two bits, not four. What you're doing here is a splat/broadcast, so you don't need it for FMA3.

+    shufpd m1, m1, m1, 0101b

The upper two bits of imm8 are ignored.

Intentional. Not ignored on FMA3.

+
+%if cpuflag(fma3)
+    fmaddpd m8, m0, m1, m8          ; sum += data[i]*data[i-j]
+%else
+    mulpd m0, m1
+    addpd m8, m0                    ; sum += data[i]*data[i-j]
+%endif
+
+    add len_pq, 8
+    cmp len_pq, lenq
+    jl .len_l
+
+    movups [autocq + lag_pq], m8    ; autoc[j] = sum
+    add lag_pq, mmsize
+    cmp lag_pq, lagq
+    jl .lag_l
+
+    ; The tail computation is guaranteed never to happen
+    ; as long as we're doing multiples of 4, rather than 2.
+    ; It is trivial to convert this to avx if ever needed.
+%if !cpuflag(avx)

This doesn't seem to be tested as is. Maybe the checkasm should try other lag values?

That's for the checkasm patch. You can trigger this check with
fate-alac-16-lpc-orders as-is.

Checkasm should test the entire function, so if an odd lag value will trigger this chunk, it should be tested.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to