On Fri, Oct 21, 2022 at 5:41 AM Kieran Kunhya <kier...@obe.tv> wrote: > > Hi, > > Please see attached an attempt to optimise the 8-bit input to v210enc to > reduce the number of shuffles. > This comes at the cost of having to extract the middle element and perform > a DWORD shift on it and then reinserting it. > I have added a few comments but any other ideas are welcome.
Random untested idea: A: db 32, 0, 48, -1, 1, 33, 2, -1, 49, 3, 34, -1, 4, 50, 5, -1 db 35, 6, 51, -1, 7, 36, 8, -1, 52, 9, 37, -1, 10, 53, 11, -1 db 38, 12, 54, -1, 13, 39, 14, -1, 55, 15, 40, -1, 16, 56, 17, -1 db 41, 18, 57, -1, 19, 42, 20, -1, 58, 21, 43, -1, 22, 59, 23, -1 B: db 1, 0, 16, 0 C: dd 0x0003fc00 [...] mova m2, [A] vpbroadcastd m3, [B] vpbroadcastd m6, [C] [...] .loop: movu ym1, [yq] vinserti32x4 m1, [uq], 2 vinserti32x4 m1, [vq], 3 CLIPUB m1, m4, m5 vpermb m1, m2, m1 pmaddubsw m0, m1, m3 pslld m1, 2 vpternlogd m0, m1, m6, 0xca movu [dstq], m0 I guess it could also be scaled to ymm if you're a big Skylake fan :P (in which case you'd probably want to reorder the shuffle indices so that chroma comes first, i.e. movq [u] + movhps [v] + vinserti32x4 [y]) _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".