Le sunnuntaina 26. toukokuuta 2024, 1.31.18 EEST James Almer a écrit : > On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote: > > The inline asm function had issues running under checkasm. > > So I came to finish what I started, and wrote the last part > > of LPC computation in assembly. > > > > autocorr_10_c: 135525.8 > > autocorr_10_sse2: 50729.8 > > autocorr_10_fma3: 19007.8 > > autocorr_30_c: 390100.8 > > autocorr_30_sse2: 142478.8 > > autocorr_30_fma3: 50559.8 > > autocorr_32_c: 407058.3 > > autocorr_32_sse2: 151633.3 > > autocorr_32_fma3: 50517.3 > > --- > > > > libavcodec/x86/lpc.asm | 91 +++++++++++++++++++++++++++++++++++++++ > > libavcodec/x86/lpc_init.c | 87 ++++--------------------------------- > > 2 files changed, 100 insertions(+), 78 deletions(-) > > > > diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm > > index a585c17ef5..790841b7f4 100644 > > --- a/libavcodec/x86/lpc.asm > > +++ b/libavcodec/x86/lpc.asm > > @@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0 > > > > dec_tab_scalar: times 2 dq -1.0 > > seq_tab_sse2: dq 1.0, 0.0 > > > > +autoc_init_tab: times 4 dq 1.0 > > + > > > > SECTION .text > > > > %macro APPLY_WELCH_FN 0 > > > > @@ -261,3 +263,92 @@ APPLY_WELCH_FN > > > > INIT_YMM avx2 > > APPLY_WELCH_FN > > %endif > > > > + > > +%macro COMPUTE_AUTOCORR_FN 0 > > +cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, lag_p, > > data_l, len_p > Already mentioned, but it should be 3 not 8. > > > + > > + shl lagd, 3 > > + shl lenq, 3 > > + xor lag_pq, lag_pq > > + > > +.lag_l: > > + movaps m8, [autoc_init_tab] > > m2 > > > + > > + mov len_pq, lag_pq > > + > > + lea data_lq, [lag_pq + mmsize - 8] > > + neg data_lq ; -j - mmsize > > + add data_lq, dataq ; data[-j - mmsize] > > +.len_l: > > + ; We waste the upper value here on SSE2, > > + ; but we use it on AVX. > > + movupd xm0, [dataq + len_pq] ; data[i] > > movsd > > > + movupd m1, [data_lq + len_pq] ; data[i - j] > > + > > +%if cpuflag(avx) > > %if mmsize == 32 here and everywhere else. > > > + vbroadcastsd m0, xm0 > > This is AVX2. AVX only has memory input argument. So use that and save > the movsd from above for the FMA3 version. > > > + vperm2f128 m1, m1, m1, 0x01 > > Aren't you loading 16 extra bytes for no reason if you're just going to > use the upper 16 bytes from the load above? > > > +%endif > > + > > + shufpd m0, m0, m0, 1100b > > The last argument has two bits, not four. What you're doing here is a > splat/broadcast, so you don't need it for FMA3. > > > + shufpd m1, m1, m1, 0101b > > The upper two bits of imm8 are ignored. > > > + > > +%if cpuflag(fma3) > > + fmaddpd m8, m0, m1, m8 ; sum += data[i]*data[i-j] > > +%else > > + mulpd m0, m1 > > + addpd m8, m0 ; sum += data[i]*data[i-j] > > +%endif > > + > > + add len_pq, 8 > > + cmp len_pq, lenq > > + jl .len_l > > + > > + movups [autocq + lag_pq], m8 ; autoc[j] = sum > > + add lag_pq, mmsize > > + cmp lag_pq, lagq > > + jl .lag_l > > + > > + ; The tail computation is guaranteed never to happen > > + ; as long as we're doing multiples of 4, rather than 2. > > + ; It is trivial to convert this to avx if ever needed. > > +%if !cpuflag(avx) > > This doesn't seem to be tested as is. Maybe the checkasm should try > other lag values?
Uh, my patch tests 10, 30 and 32, so I am not clear what you think is missing here. -- レミ・デニ-クールモン http://www.remlab.net/ _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".