Hello, >+pb_shuffle_low: times 4 db 1, 3, 5, 7, 9, 11, 13, 15, -1, -1, -1, -1, -1, -1, >-1, -1 Why we times 4? AVX2 provided instruction VPBROADCASTQ to load these constant into SIMD register.
Moreover, the plane U/V also apply same algorithm to get improve. Regards, Min Chen At 2021-09-30 09:56:11, "Wu Jianhua" <jianhua...@intel.com> wrote: >With the accelerating by means of AVX2, the uyvytoyuv422 can be faster > >Performance data(Less is better): > uyvytoyuv422_sse2 0.50388 > uyvytoyuv422_avx 0.46132 > uyvytoyuv422_avx2 0.27309 > >Signed-off-by: Wu Jianhua <jianhua...@intel.com> >--- > libswscale/x86/rgb2rgb.c | 6 ++++ > libswscale/x86/rgb_2_rgb.asm | 60 ++++++++++++++++++++++++++++-------- > 2 files changed, 53 insertions(+), 13 deletions(-) > >diff --git a/libswscale/x86/rgb2rgb.c b/libswscale/x86/rgb2rgb.c >index c9ff33ab77..a965a1755c 100644 >--- a/libswscale/x86/rgb2rgb.c >+++ b/libswscale/x86/rgb2rgb.c >@@ -164,6 +164,9 @@ void ff_uyvytoyuv422_sse2(uint8_t *ydst, uint8_t *udst, >uint8_t *vdst, > void ff_uyvytoyuv422_avx(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, > const uint8_t *src, int width, int height, > int lumStride, int chromStride, int srcStride); >+void ff_uyvytoyuv422_avx2(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, >+ const uint8_t *src, int width, int height, >+ int lumStride, int chromStride, int srcStride); > #endif > > >_______________________________________________ >ffmpeg-devel mailing list >ffmpeg-devel@ffmpeg.org >https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > >To unsubscribe, visit link above, or email >ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".