On Wed, 28 Aug 2024 13:30:02 +0200 Niklas Haas <ffm...@haasn.xyz> wrote: > On Tue, 27 Aug 2024 21:47:59 +0300 Rémi Denis-Courmont <r...@remlab.net> > wrote: > > Le 27 août 2024 17:12:03 GMT+03:00, Niklas Haas <ffm...@haasn.xyz> a écrit : > > >> > + .irp x, \vregs > > >> > + vmax.vx \x, \x, zero > > >> > + .endr > > >> > + vsetvli zero, zero, e8, \lmul, ta, ma > > >> > + .irp x, \vregs > > >> > + vnclipu.wi \x, \x, \shifti > > >> > + .endr > > >> > +.endm > > >> > + > > >> > +.macro lowpass_init lmul, sizei, size, w0, w1, backup > > >> > > >> This is needlessly convoluted. In fact, backup is not even used, which > > >> kind > > >> of highlights the point. > > > > > >That parameter was simply left over from a previous version of the code. > > > > That would not have happened if this was not a macro. > > > > >Are you suggesting we simply duplicate the contents of this macro into all > > >of > > >thefunctions that use it? > > > > What are you implying here? Can you point to any other .S file from Arm, > > Aarch64, LoongArch or RV that does this? > > > > This macro can only realistically be used once per function - at the > > beginning. Do you typically make macros for declaring and initialising > > local > > variables in other languages? Because I don't and I don't know anybody else > > that does. > > > > And to make things worse, it has a conditional. TBH, this patch is > > unreviewable to me. It's simply too hard to read because of excess macro > > usage > > and excess macro parameter on top. > > Changed. > > > > > >> > + vsetivli zero, \sizei, e8, \lmul, ta, ma > > >> > + csrwi vxrm, 0 > > >> > + li \size, \sizei > > >> > + .ifnb \w0 > > >> > + li \w0, 20 > > >> > + li \w1, -5 > > >> > + .endif > > >> > +.endm > > >> > + > > >> > + /* output is unclipped; clobbers v26-v31 plus \tmp and \tmp2 > > >> > */ > > >> > +.macro lowpass_h vdst, src, w0, w1, tmp=t3, tmp2=t4 > > >> > + addi \tmp, \src, 3 > > >> > + lbu \tmp2, 2(\src) > > >> > + vle8.v v31, (\tmp) > > >> > + lbu \tmp, 1(\src) > > >> > + vslide1up.vx v30, v31, \tmp2 > > >> > + lbu \tmp2, 0(\src) > > >> > + vslide1up.vx v29, v30, \tmp > > >> > + lbu \tmp, -1(\src) > > >> > + vslide1up.vx v28, v29, \tmp2 > > >> > + lbu \tmp2, -2(\src) > > >> > + vslide1up.vx v27, v28, \tmp > > >> > + vslide1up.vx v26, v27, \tmp2 > > >> > > >> That's a lot of sequentially dependent vector instructions to save zero- > > >> extending v31 before the MACs. Are you sure it's faster that way? > > > > > >I'm not sure what you mean. How would your alternative implementation look? > > >It's certainly possible to make these instructions less sequential by > > >emitting multiple `lbu` instructions instead of sliding up. > > > > Slides are actually quite slow, but they're unavoidable here. The point is > > that you wouldn't need v26 up-front if you zero-extended v31 first. And > > then > > you would be able to interleave non-dependent instructions. > > > > That doesn't affect the number of slides and scalar loads. > > Right, the reason I did this is that there's afaict no instruction for a > "widening accumulate", that is, no equivalent to vwmaccu that doesn't take an > extra scalar multiplicand. So the alternative here requires an extra scalar > instruction and multiplication. I'll bench it and get back to you.
Single vwaddu: put_h264_qpel_16_mc20_8_rvv_i32: 420.7 ( 4.29x) Zero extend + separate vwmaccu: put_h264_qpel_16_mc20_8_rvv_i32: 433.9 ( 4.17x) So the hit from having a dependent vector instruction is not worth the loss due to an extra accumulate instruction to deal with v26. > > > > > >> > > >> > + vwaddu.vv \vdst, v26, v31 > > >> > + vwmaccu.vx \vdst, \w0, v28 > > >> > + vwmaccu.vx \vdst, \w0, v29 > > >> > + vwmaccsu.vx \vdst, \w1, v27 > > >> > + vwmaccsu.vx \vdst, \w1, v30 > > >> > +.endm > > >> > + > > >> > + /* output is unclipped */ > > >> > +.macro lowpass_v w0, w1, vdst, vsrc0, vsrc1, vsrc2, vsrc3, > > vsrc4, > > >> > vsrc5, signed=0 > > >> > + .if \signed > > >> > + vwadd.vv \vdst, \vsrc0, \vsrc5 > > >> > + vwmacc.vx \vdst, \w0, \vsrc2 > > >> > + vwmacc.vx \vdst, \w0, \vsrc3 > > >> > + vwmacc.vx \vdst, \w1, \vsrc1 > > >> > + vwmacc.vx \vdst, \w1, \vsrc4 > > >> > + .else > > >> > + vwaddu.vv \vdst, \vsrc0, \vsrc5 > > >> > + vwmaccu.vx \vdst, \w0, \vsrc2 > > >> > + vwmaccu.vx \vdst, \w0, \vsrc3 > > >> > + vwmaccsu.vx \vdst, \w1, \vsrc1 > > >> > + vwmaccsu.vx \vdst, \w1, \vsrc4 > > >> > + .endif > > >> > +.endm > > >> > + > > >> > +.macro qpel_mc00 op, dst, src, stride, size > > >> > +func ff_\op\()_h264_qpel_pixels, zve32x > > >> > +1: > > >> > + add t0, \stride, \src > > >> > + add t1, \stride, t0 > > >> > + add t2, \stride, t1 > > >> > + vle8.v v0, (\src) > > >> > + vle8.v v1, (t0) > > >> > + vle8.v v2, (t1) > > >> > + vle8.v v3, (t2) > > >> > + addi \size, \size, -4 > > >> > + add \src, \stride, t2 > > >> > + add t0, \stride, \dst > > >> > + add t1, \stride, t0 > > >> > + add t2, \stride, t1 > > >> > + .ifc \op, avg > > >> > + vle8.v v4, (\dst) > > >> > + vle8.v v5, (t0) > > >> > + vle8.v v6, (t1) > > >> > + vle8.v v7, (t2) > > >> > + vaaddu.vv v0, v0, v4 > > >> > + vaaddu.vv v1, v1, v5 > > >> > + vaaddu.vv v2, v2, v6 > > >> > + vaaddu.vv v3, v3, v7 > > >> > + .endif > > >> > + vse8.v v0, (\dst) > > >> > + vse8.v v1, (t0) > > >> > + vse8.v v2, (t1) > > >> > + vse8.v v3, (t2) > > >> > + add \dst, \stride, t2 > > >> > + bnez \size, 1b > > >> > + ret > > >> > +endfunc > > >> > +.endm > > >> > + > > >> > + qpel_mc00 put, a0, a1, a2, a4 > > >> > + qpel_mc00 avg, a0, a1, a2, a4 > > >> > > >> Please don't add constant macro parameters. > > > > > >Why? > > > > It makes the code prohibitively difficult to read, review and revector. > > Changed. > > > > > > It makes the code much easier to modify, > > > > The opposite actually. And that's not just me. From a quick look, Arm, > > Aarch64 > > and LoongArch assembler is also not doing that. > > > > Thing is, those parameter are *not* variables, they are *registers*. You > > need > > to know which register of which type is used where, and, in the case of > > vectors, what the number alignment is. That is vastly more relevant than > > what > > value a register represents whilst reviewing amnd *also* if revectoring. > > Besides you can always comment what value is where. You can't reasonably > > comment what register a macro parameter is. > > > > And then constant arguments hide the commonality of the code, leading to > > unnecessary duplication. We've had it happen already (VP8 IIRC). > > > > > and arguably also to understand. > > > > To be fair, I also thought that way when I started doing outline assembler > > a > > decade and a half ago. But like everyone else in FFmpeg, x264, dav1d, > > Linux, > > etc, I grew out of that nonsense. > > > > >This design was certainly invaluable during the development process. If you > > >prefer, we could "bake" the result, but at the cost of future > > refactorability. > > > > > >Given the state of RISC-V hardware, I'd rather leave the code in a state > > >that > > >lends itself more towards future modifications. > > > > I disagree and it seems that all of the existing code RISC-ish ISA > > assembler > > in FFmpeg disagrees too... > > > > >> > + > > >> > +.macro qpel_lowpass op, ext, lmul, lmul2, dst, src, dst_stride, > > >> > src_stride, size, w0, w1, src2, src2_stride > > >> > +func > > >> > ff_\op\()_h264_qpel_h_lowpass_\lmul\ext, zve32x > > >> > +1: > > >> > + add t0, \src_stride, \src > > >> > + add t1, \src_stride, t0 > > >> > + add t2, \src_stride, t1 > > >> > + lowpass_h v0, \src, \w0, \w1 > > >> > + lowpass_h v2, t0, \w0, \w1 > > >> > + lowpass_h v4, t1, \w0, \w1 > > >> > + lowpass_h v6, t2, \w0, \w1 > > >> > + add \src, \src_stride, t2 > > >> > + addi \size, \size, -4 > > >> > + vnclipsu.wi 5, \lmul, \lmul2, v0, v2, v4, v6 > > >> > + .ifnb \src2 > > >> > + add t0, \src2_stride, \src2 > > >> > + add t1, \src2_stride, t0 > > >> > + add t2, \src2_stride, t1 > > >> > + vle8.v v8, (\src2) > > >> > + vle8.v v10, (t0) > > >> > + vle8.v v12, (t1) > > >> > + vle8.v v14, (t2) > > >> > + add \src2, \dst_stride, t2 > > >> > + vaaddu.vv v0, v0, v8 > > >> > + vaaddu.vv v2, v2, v10 > > >> > + vaaddu.vv v4, v4, v12 > > >> > + vaaddu.vv v6, v6, v14 > > >> > + .endif > > >> > + add t0, \dst_stride, \dst > > >> > + add t1, \dst_stride, t0 > > >> > + add t2, \dst_stride, t1 > > >> > + .ifc \op, avg > > >> > + vle8.v v1, (\dst) > > >> > + vle8.v v3, (t0) > > >> > + vle8.v v5, (t1) > > >> > + vle8.v v7, (t2) > > >> > + vaaddu.vv v0, v0, v1 > > >> > + vaaddu.vv v2, v2, v3 > > >> > + vaaddu.vv v4, v4, v5 > > >> > + vaaddu.vv v6, v6, v7 > > >> > + .endif > > >> > + vse8.v v0, (\dst) > > >> > + vse8.v v2, (t0) > > >> > + vse8.v v4, (t1) > > >> > + vse8.v v6, (t2) > > >> > + add \dst, \dst_stride, t2 > > >> > + bnez \size, 1b > > >> > + ret > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel_v_lowpass_\lmul\ext, zve32x > > >> > + sub t0, \src, \src_stride > > >> > + sub t1, t0, \src_stride > > >> > + vle8.v v2, (\src) > > >> > + vle8.v v1, (t0) > > >> > + vle8.v v0, (t1) > > >> > + add t0, \src, \src_stride > > >> > + add t1, t0, \src_stride > > >> > + add \src, t1, \src_stride > > >> > + vle8.v v3, (t0) > > >> > + vle8.v v4, (t1) > > >> > +1: > > >> > + add t0, \src_stride, \src > > >> > + add t1, \src_stride, t0 > > >> > + add t2, \src_stride, t1 > > >> > + vle8.v v5, (\src) > > >> > + vle8.v v6, (t0) > > >> > + vle8.v v7, (t1) > > >> > + vle8.v v8, (t2) > > >> > + add \src, \src_stride, t2 > > >> > + lowpass_v \w0, \w1, v24, v0, v1, v2, v3, v4, v5 > > >> > + lowpass_v \w0, \w1, v26, v1, v2, v3, v4, v5, v6 > > >> > + lowpass_v \w0, \w1, v28, v2, v3, v4, v5, v6, v7 > > >> > + lowpass_v \w0, \w1, v30, v3, v4, v5, v6, v7, v8 > > >> > + addi \size, \size, -4 > > >> > + vnclipsu.wi 5, \lmul, \lmul2, v24, v26, v28, v30 > > >> > + .ifnb \src2 > > >> > + add t0, \src2_stride, \src2 > > >> > + add t1, \src2_stride, t0 > > >> > + add t2, \src2_stride, t1 > > >> > + vle8.v v9, (\src2) > > >> > + vle8.v v10, (t0) > > >> > + vle8.v v11, (t1) > > >> > + vle8.v v12, (t2) > > >> > + add \src2, \src2_stride, t2 > > >> > + vaaddu.vv v24, v24, v9 > > >> > + vaaddu.vv v26, v26, v10 > > >> > + vaaddu.vv v28, v28, v11 > > >> > + vaaddu.vv v30, v30, v12 > > >> > + .endif > > >> > + add t0, \dst_stride, \dst > > >> > + add t1, \dst_stride, t0 > > >> > + add t2, \dst_stride, t1 > > >> > + .ifc \op, avg > > >> > + vle8.v v9, (\dst) > > >> > + vle8.v v10, (t0) > > >> > + vle8.v v11, (t1) > > >> > + vle8.v v12, (t2) > > >> > + vaaddu.vv v24, v24, v9 > > >> > + vaaddu.vv v26, v26, v10 > > >> > + vaaddu.vv v28, v28, v11 > > >> > + vaaddu.vv v30, v30, v12 > > >> > + .endif > > >> > + vse8.v v24, (\dst) > > >> > + vse8.v v26, (t0) > > >> > + vse8.v v28, (t1) > > >> > + vse8.v v30, (t2) > > >> > + add \dst, \dst_stride, t2 > > >> > + vmv.v.v v0, v4 > > >> > + vmv.v.v v1, v5 > > >> > + vmv.v.v v2, v6 > > >> > + vmv.v.v v3, v7 > > >> > + vmv.v.v v4, v8 > > >> > > >> At this point, any vector move without rationale is an automatic -1 from > > >> me. > > > > > >There is a rationale; > > > > I can't see any rationale in the comments or description. > > > > >the vectors are reused for the next pass of the (unrolled) vertical > > >convolution. The only way to eliminate them would be to > > >make a special path for 8x8 that urolls all 8 lines to avoid this vector > > >move, > > > > Typically you only need to unroll 2x to eliminate vector copies. And it's > > not > > ONE vector copy, it's FOUR vector copies. Without actual numbers, I don't > > trust that the performance loss is negligible. > > How do you implement a vertical convolution without either redundant loads or > vector moves? > > > > > >but IMO the gain in performance does not justify the increase in complexity > > >and binary size. > > > > >> > > >> > + bnez \size, 1b > > >> > + ret > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel_hv_lowpass_\lmul\ext, zve32x > > >> > + sub t0, \src, \src_stride > > >> > + sub t1, t0, \src_stride > > >> > + lowpass_h v4, \src, \w0, \w1 > > >> > + lowpass_h v2, t0, \w0, \w1 > > >> > + lowpass_h v0, t1, \w0, \w1 > > >> > + add t0, \src, \src_stride > > >> > + add t1, t0, \src_stride > > >> > + add \src, t1, \src_stride > > >> > + lowpass_h v6, t0, \w0, \w1 > > >> > + lowpass_h v8, t1, \w0, \w1 > > >> > +1: > > >> > + add t0, \src_stride, \src > > >> > + add t1, \src_stride, t0 > > >> > + add t2, \src_stride, t1 > > >> > + lowpass_h v10, \src, \w0, \w1 > > >> > + lowpass_h v12, t0, \w0, \w1 > > >> > + lowpass_h v14, t1, \w0, \w1 > > >> > + lowpass_h v16, t2, \w0, \w1 > > >> > + vsetvli zero, zero, e16, \lmul2, ta, ma > > >> > + addi \size, \size, -4 > > >> > + lowpass_v \w0, \w1, v20, v0, v2, v4, v6, v8, v10, > > >> > signed=1 > > >> > + lowpass_v \w0, \w1, v24, v2, v4, v6, v8, v10, > > >> > v12, signed=1 > > >> > + lowpass_v \w0, \w1, v28, v4, v6, v8, v10, > > >> > v12, v14, signed=1 > > >> > + vnclip.wi v0, v20, 10 > > >> > + lowpass_v \w0, \w1, v20, v6, v8, v10, v12, v14, v16, > > >> > signed= > > >> > + vnclip.wi v2, v24, 10 > > >> > + vnclip.wi v4, v28, 10 > > >> > + vnclip.wi v6, v20, 10 > > >> > + vmax.vx v18, v0, zero > > >> > + vmax.vx v20, v2, zero > > >> > + vmax.vx v22, v4, zero > > >> > + vmax.vx v24, v6, zero > > >> > + vmv.v.v v0, v8 > > >> > + vmv.v.v v2, v10 > > >> > + vmv.v.v v4, v12 > > >> > + vmv.v.v v6, v14 > > >> > + vmv.v.v v8, v16 > > >> > + add \src, \src_stride, t2 > > >> > + vsetvli zero, zero, e8, \lmul, ta, ma > > >> > + vnclipu.wi v18, v18, 0 > > >> > + vnclipu.wi v20, v20, 0 > > >> > + vnclipu.wi v22, v22, 0 > > >> > + vnclipu.wi v24, v24, 0 > > >> > + .ifnb \src2 > > >> > + add t0, \src2_stride, \src2 > > >> > + add t1, \src2_stride, t0 > > >> > + add t2, \src2_stride, t1 > > >> > + vle8.v v26, (\src2) > > >> > + vle8.v v27, (t0) > > >> > + vle8.v v28, (t1) > > >> > + vle8.v v29, (t2) > > >> > + add \src2, \src2_stride, t2 > > >> > + vaaddu.vv v18, v18, v26 > > >> > + vaaddu.vv v20, v20, v27 > > >> > + vaaddu.vv v22, v22, v28 > > >> > + vaaddu.vv v24, v24, v29 > > >> > + .endif > > >> > + add t0, \dst_stride, \dst > > >> > + add t1, \dst_stride, t0 > > >> > + add t2, \dst_stride, t1 > > >> > + .ifc \op, avg > > >> > + vle8.v v26, (\dst) > > >> > + vle8.v v27, (t0) > > >> > + vle8.v v28, (t1) > > >> > + vle8.v v29, (t2) > > >> > + vaaddu.vv v18, v18, v26 > > >> > + vaaddu.vv v20, v20, v27 > > >> > + vaaddu.vv v22, v22, v28 > > >> > + vaaddu.vv v24, v24, v29 > > >> > + .endif > > >> > + vse8.v v18, (\dst) > > >> > + vse8.v v20, (t0) > > >> > + vse8.v v22, (t1) > > >> > + vse8.v v24, (t2) > > >> > + add \dst, \dst_stride, t2 > > >> > + bnez \size, 1b > > >> > + ret > > >> > +endfunc > > >> > +.endm > > >> > + > > >> > +/* Note: We could possibly specialize for the width 8 / width 4 cases > > >> > by > > >> > + loading 32 bit integers, but this makes the convolutions more > > >> > complicated + to implement, so it's not necessarily any faster. */ > > >> > + > > >> > +.macro h264_qpel lmul, lmul2 > > >> > + qpel_lowpass put, , \lmul, \lmul2, a0, a1, a2, a3, a4, > > t5, > > >> > t6 > > >> > + qpel_lowpass put, _l2, \lmul, \lmul2, a0, a1, a2, a3, a4, > > >> > t5, t6, a5, a6 > > >> > + qpel_lowpass avg, , \lmul, \lmul2, a0, a1, > > >> > a2, a3, a4, t5, t6 > > >> > + qpel_lowpass avg, _l2, \lmul, \lmul2, a0, > > >> > a1, a2, a3, a4, t5, t6, a5, a6 > > >> > +.endm > > >> > + > > >> > + h264_qpel m1, m2 > > >> > + h264_qpel mf2, m1 > > >> > + h264_qpel mf4, mf2 > > >> > + h264_qpel mf8, mf4 > > >> > + > > >> > +.macro ff_h264_qpel_fns op, lmul, sizei, ext=rvv, dst, src, > > >> > dst_stride, > > >> > src_stride, size, w0, w1, src2, src2_stride, tmp > > >> > +func > > >> > ff_\op\()_h264_qpel\sizei\()_mc00_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, > > >> > + j ff_\op\()_h264_qpel_pixels > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc10_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + mv \src_stride, \dst_stride > > >> > + mv \src2, \src > > >> > + mv \src2_stride, \src_stride > > >> > + j ff_\op\()_h264_qpel_h_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc20_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + mv \src_stride, \dst_stride > > >> > + j ff_\op\()_h264_qpel_h_lowpass_\lmul\() > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc30_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + mv \src_stride, \dst_stride > > >> > + addi \src2, \src, 1 > > >> > + mv \src2_stride, \src_stride > > >> > + j ff_\op\()_h264_qpel_h_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc01_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + mv \src_stride, \dst_stride > > >> > + mv \src2, \src > > >> > + mv \src2_stride, \src_stride > > >> > + j ff_\op\()_h264_qpel_v_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc02_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + mv \src_stride, \dst_stride > > >> > + j ff_\op\()_h264_qpel_v_lowpass_\lmul > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc03_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + mv \src_stride, \dst_stride > > >> > + add \src2, \src, \src_stride > > >> > + mv \src2_stride, \src_stride > > >> > + j ff_\op\()_h264_qpel_v_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc11_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + push \dst, \src > > >> > > >> It's all but impossible to tell if spilling is actually necessary when > > >> you > > >> alias registers like this. > > >> > > >> > + mv \tmp, ra > > >> > > >> Use t0 for subprocedure return. See specs. > > > > > >The subprocedure is sometimes the main procedure. > > > > Sure does not seem that way, but again, the code is so damn hard to follow. > > > > >And in any case, we use t0 > > >inside the subprocedure. > > > > Then fix it. > > > > > > > >> > > >> > + mv \src_stride, \dst_stride > > >> > + addi \dst, sp, -(\sizei * \sizei) > > >> > + li \dst_stride, \sizei > > >> > + call ff_put_h264_qpel_h_lowpass_\lmul > > >> > > >> You can use jal here > > > > > >Shouldn't the assembler be responsible for inserting the correct procedure > > >call instruction? > > > > Doesn't work here (GNU as 2.43.1). > > > > >> > + addi \src2, sp, -(\sizei * \sizei) > > >> > + mv \src2_stride, \dst_stride > > >> > + pop \dst, \src > > >> > + mv \dst_stride, \src_stride > > >> > + li \size, \sizei > > >> > + mv ra, \tmp > > >> > + j ff_\op\()_h264_qpel_v_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc31_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + push \dst, \src > > >> > + mv \tmp, ra > > >> > + mv \src_stride, \dst_stride > > >> > + addi \dst, sp, -(\sizei * \sizei) > > >> > + li \dst_stride, \sizei > > >> > + call ff_put_h264_qpel_h_lowpass_\lmul > > >> > + addi \src2, sp, -(\sizei * \sizei) > > >> > + mv \src2_stride, \dst_stride > > >> > + pop \dst, \src > > >> > + addi \src, \src, 1 > > >> > + mv \dst_stride, \src_stride > > >> > + li \size, \sizei > > >> > + mv ra, \tmp > > >> > + j ff_\op\()_h264_qpel_v_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc13_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + push \dst, \src > > >> > + mv \tmp, ra > > >> > + mv \src_stride, \dst_stride > > >> > + add \src, \src, \src_stride > > >> > + addi \dst, sp, -(\sizei * \sizei) > > >> > + li \dst_stride, \sizei > > >> > + call ff_put_h264_qpel_h_lowpass_\lmul > > >> > + addi \src2, sp, -(\sizei * \sizei) > > >> > + mv \src2_stride, \dst_stride > > >> > + pop \dst, \src > > >> > + mv \dst_stride, \src_stride > > >> > + li \size, \sizei > > >> > + mv ra, \tmp > > >> > + j ff_\op\()_h264_qpel_v_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc33_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + push \dst, \src > > >> > + mv \tmp, ra > > >> > + mv \src_stride, \dst_stride > > >> > + add \src, \src, \src_stride > > >> > + addi \dst, sp, -(\sizei * \sizei) > > >> > + li \dst_stride, \sizei > > >> > + call ff_put_h264_qpel_h_lowpass_\lmul > > >> > + addi \src2, sp, -(\sizei * \sizei) > > >> > + mv \src2_stride, \dst_stride > > >> > + pop \dst, \src > > >> > + addi \src, \src, 1 > > >> > + mv \dst_stride, \src_stride > > >> > + li \size, \sizei > > >> > + mv ra, \tmp > > >> > + j ff_\op\()_h264_qpel_v_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc22_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + mv \src_stride, \dst_stride > > >> > + j ff_\op\()_h264_qpel_hv_lowpass_\lmul > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc21_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + push \dst, \src > > >> > + mv \tmp, ra > > >> > + mv \src_stride, \dst_stride > > >> > + addi \dst, sp, -(\sizei * \sizei) > > >> > + li \dst_stride, \sizei > > >> > + call ff_put_h264_qpel_h_lowpass_\lmul > > >> > + addi \src2, sp, -(\sizei * \sizei) > > >> > + mv \src2_stride, \dst_stride > > >> > + pop \dst, \src > > >> > + mv \dst_stride, \src_stride > > >> > + li \size, \sizei > > >> > + mv ra, \tmp > > >> > + j ff_\op\()_h264_qpel_hv_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc23_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + push \dst, \src > > >> > + mv \tmp, ra > > >> > + mv \src_stride, \dst_stride > > >> > + add \src, \src, \src_stride > > >> > + addi \dst, sp, -(\sizei * \sizei) > > >> > + li \dst_stride, \sizei > > >> > + call ff_put_h264_qpel_h_lowpass_\lmul > > >> > + addi \src2, sp, -(\sizei * \sizei) > > >> > + mv \src2_stride, \dst_stride > > >> > + pop \dst, \src > > >> > + mv \dst_stride, \src_stride > > >> > + li \size, \sizei > > >> > + mv ra, \tmp > > >> > + j ff_\op\()_h264_qpel_hv_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc12_\ext, zve32x > > >> > + lowpass_init \lmul, \sizei, \size, \w0, \w1 > > >> > + push \dst, \src > > >> > + mv \tmp, ra > > >> > + mv \src_stride, \dst_stride > > >> > + addi \dst, sp, -(\sizei * \sizei) > > >> > + li \dst_stride, \sizei > > >> > + call ff_put_h264_qpel_v_lowpass_\lmul > > >> > + addi \src2, sp, -(\sizei * \sizei) > > >> > + mv \src2_stride, \dst_stride > > >> > + pop \dst, \src > > >> > + mv \dst_stride, \src_stride > > >> > + li \size, \sizei > > >> > + mv ra, \tmp > > >> > + j ff_\op\()_h264_qpel_hv_lowpass_\lmul\()_l2 > > >> > +endfunc > > >> > + > > >> > +func ff_\op\()_h264_qpel\sizei\()_mc > > > > > > > > _______________________________________________ > > ffmpeg-devel mailing list > > ffmpeg-devel@ffmpeg.org > > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > To unsubscribe, visit link above, or email > > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".