Min Chen wrote: > > The current algoithm may get improve, may you combin these optimize with > your patches? since extra VPERM make code a little more slower. > > > > On Haswell > Current alogithm: > RSHIFT_COPY m6, m2, 1 ; UYVY UYVY -> YVYU YVY... > pand m6, m1; YxYx YxYx... RSHIFT_COPY m7, m3, 1 ; UYVY UYVY -> YVYU YVY... > pand m7, m1 ; YxYx YxYx... packuswb m6, m7 ; YYYY YYYY... > > > Latency: > 1 + 1 + 1 + 1 + 1 = 5 > > > Proposed: > pshufb m6, m2, mX ; UYVY UYVY -> xxxx YYYY pshufb m7, m3, mX > punpcklqdq m6, m7 ; YYYY YYYY > > > Latency: > 1 + 1 + 1 = 3 > > > I guess the current algorithm optimize for compatible with SSE2, because > PSHUFB addition since SSSE3. > Now, we try to optimzie with AVX, AVX2 and AVX512, so I suggest we use > proposed algorithm to get more performance. > > > Regards, > Min Chen >
Hi Min Chen, Thanks for the careful review. You're right. Using the specific functionalities added in AVX2/512 should be better. I'll try your proposal and see if it has a better performance. If so, I'll resubmit the new patches. Best regards, Jianhua _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".