On 8/17/2024 10:48 PM, Nuo Mi wrote:
+ pxor m6, m6 + phaddw m%2, m6 + phaddw m%2, m6
Horizonal adds are slow. Can't you do this with normal adds, shifts and blend?
+ vpermq m%2, m%2, q0020 + pshufd m%2, m%2, q1120 + pmovsxwd m%2, xmm%2 ; 4 sgxgy + + pmulld m%2, m11 ; 4 vx * sgxgy
Similarly, pmulld is super slow (Ten cycles in some architectures), and that's on top of a pmovsx. Since you have m6 zeroed already, wouldn't pmaddwd work here? The pd_15 and pd_m15 constants would need to be changed to words, as would the values to be clipped.
+ psrad m%2, 1
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".