On Sat, May 20, 2023 at 1:12 AM Rémi Denis-Courmont <r...@remlab.net> wrote:
> > + li t4, 0 > > + li t2, 0 > > + addi a5, t3, 1 > > + slli t3, a2, 2 > > +.LBB0_3: # if (xy != 0) > > + add a4, a1, t4 > > + vsetvli zero, a5, e8, m1, ta, ma > > + addiw t2, t2, 4 > > + vle8.v v10, (a4) > > + add a4, a4, a2 > > + vslidedown.vi v11, v10, 1 > > Isn't vslide1down.vx zero potentially faster than vslidedown.vi 1? > It depends on hardware design, but in general, vslide1down.vi is typically not slower than vslidedown.vx Using vslide1down.vi would be better here, I will fix it. > > + vsetivli zero, 8, e8, m1, ta, ma > > Do we really need to reconfigure the active vector length so many times? I > suspect that is not going to go down to well with some implementations. > We need to reconfigure it because the VL is changed. The VL for vslidedown differs from that of the other instructions. > + vwmaccu.vx v12, t1, v15 > > + vwmaccu.vx v16, a7, v15 > > + vsetvli zero, a5, e8, m1, ta, ma > > + vle8.v v14, (a4) > > + vsetivli zero, 8, e8, m1, ta, ma > > + add a4, a0, t4 > > + add t4, t4, t3 > > I could be totally wrong since I have no hardware to verify with, but I > would > assume that it is preferable to interleave independent scalar and vector > instructions whence possible. For out-of-order processors, it shouldn't > matter, but I suppose that it would on in-order multi-issue processors. > Interleaving those instructions can improve overall performance _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".