On Sun, 14 Jan 2018, Henrik Gramner wrote:

On Sat, Jan 13, 2018 at 10:57 PM, Marton Balint <c...@passwd.hu> wrote:
+    .loop:
+        movu            m0, [src1q + xq]
+        movu            m1, [src2q + xq]
+        punpckl%1%2     m5, m0, m2         ; 0e0f0g0h
+        punpckh%1%2     m0, m2             ; 0a0b0c0d
+        punpckl%1%2     m6, m1, m2         ; 0E0F0G0H
+        punpckh%1%2     m1, m2             ; 0A0B0C0D
+        pmull%2         m0, m3
+        pmull%2         m5, m3
+        pmull%2         m1, m4
+        pmull%2         m6, m4
+        padd%2          m0, m7
+        padd%2          m5, m7
+        padd%2          m0, m1
+        padd%2          m5, m6

pmaddubsw should work here for the 8-bit case. pmaddwd might work for
the 16-bit case depending on how many bits are actually used.


As far as I see, I have to make the blending factors 7-bit (15-bit) in order for this to work because pmadd* functions are working on signed integers. Losing 1 bit of precision of the blending factors is probably not a problem for the framerate filter.

So my loop would look like this:

    .loop:
        movu            m0, [src1q + xq]
        movu            m1, [src2q + xq]
        SBUTTERFLY     %1%2, 0, 1, 5        ; aAbBcCdD
                                            ; eEfFgGhH
        pmadd%3         m0, m3
        pmadd%3         m1, m3

        padd%2          m0, m7
        padd%2          m1, m7
        psrl%2          m0, %4              ; 0A0B0C0D
        psrl%2          m1, %4              ; 0E0F0G0H

        packus%2%1      m0, m1              ; ABCDEFGH
        movu   [dstq + xq], m0
        add             xq, mmsize
    jl .loop

Is this what you had in mind?

+    pinsrw    xm3, r8m, 0                   ; factor1
+    pinsrw    xm4, r9m, 0                   ; factor2
+    pinsrw    xm7, r10m, 0                  ; half
+    SPLATW     m3, xm3
+    SPLATW     m4, xm4
+    SPLATW     m7, xm7

vpbroadcast* from memory on avx2, otherwise movd instead of pxor+pinsrw.

+    pxor       m3, m3
+    pxor       m4, m4
+    pxor       m7, m7
+    pinsrw    xm3, r8m, 0                   ; factor1
+    pinsrw    xm4, r9m, 0                   ; factor2
+    pinsrw    xm7, r10m, 0                  ; half
+    XSPLATD       3
+    XSPLATD       4
+    XSPLATD       7

Ditto.

+    neg word r11m                           ; shift = -shift
+    add word r11m, 16                       ; shift += 16
+    pxor       m2, m2
+    pinsrw    xm2, r11m, 0                  ; 16 - shift
+    pslld      m3, xm2
+    pslld      m4, xm2
+    pslld      m7, xm2

You probably want to use a temporary register instead of doing slow
load-modify-store instructions.

Ok, I will rework these, although these parts are only the initialization code, so I guess these are not performance critical.

Thanks,
Marton
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to