On Sat, Jan 13, 2018 at 10:57 PM, Marton Balint <c...@passwd.hu> wrote: > + .loop: > + movu m0, [src1q + xq] > + movu m1, [src2q + xq] > + punpckl%1%2 m5, m0, m2 ; 0e0f0g0h > + punpckh%1%2 m0, m2 ; 0a0b0c0d > + punpckl%1%2 m6, m1, m2 ; 0E0F0G0H > + punpckh%1%2 m1, m2 ; 0A0B0C0D > + pmull%2 m0, m3 > + pmull%2 m5, m3 > + pmull%2 m1, m4 > + pmull%2 m6, m4 > + padd%2 m0, m7 > + padd%2 m5, m7 > + padd%2 m0, m1 > + padd%2 m5, m6
pmaddubsw should work here for the 8-bit case. pmaddwd might work for the 16-bit case depending on how many bits are actually used. > + pinsrw xm3, r8m, 0 ; factor1 > + pinsrw xm4, r9m, 0 ; factor2 > + pinsrw xm7, r10m, 0 ; half > + SPLATW m3, xm3 > + SPLATW m4, xm4 > + SPLATW m7, xm7 vpbroadcast* from memory on avx2, otherwise movd instead of pxor+pinsrw. > + pxor m3, m3 > + pxor m4, m4 > + pxor m7, m7 > + pinsrw xm3, r8m, 0 ; factor1 > + pinsrw xm4, r9m, 0 ; factor2 > + pinsrw xm7, r10m, 0 ; half > + XSPLATD 3 > + XSPLATD 4 > + XSPLATD 7 Ditto. > + neg word r11m ; shift = -shift > + add word r11m, 16 ; shift += 16 > + pxor m2, m2 > + pinsrw xm2, r11m, 0 ; 16 - shift > + pslld m3, xm2 > + pslld m4, xm2 > + pslld m7, xm2 You probably want to use a temporary register instead of doing slow load-modify-store instructions. Doing this in SIMD might be an option as well, e.g. load data directly into vector regs from the stack, shift, then splat. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel