On 8/17/2024 10:48 PM, Nuo Mi wrote:
+    pxor                    m6, m6
+    phaddw                 m%2, m6
+    phaddw                 m%2, m6

Horizonal adds are slow. Can't you do this with normal adds, shifts and blend?

+    vpermq                 m%2, m%2, q0020
+    pshufd                 m%2, m%2, q1120
+    pmovsxwd               m%2, xmm%2               ; 4 sgxgy
+
+    pmulld                 m%2, m11                 ; 4 vx * sgxgy

Similarly, pmulld is super slow (Ten cycles in some architectures), and that's on top of a pmovsx. Since you have m6 zeroed already, wouldn't pmaddwd work here? The pd_15 and pd_m15 constants would need to be changed to words, as would the values to be clipped.

+    psrad                  m%2, 1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to