LGTM, thanks! On Wed, Sep 28, 2022 at 11:13 AM Martin Storsjö <mar...@martin.st> wrote:
> This avoids one redundant load per row; pix3 from the previous > iteration can be used as pix2 in the next one. > > Before: Cortex A53 A72 A73 > pix_abs_0_2_neon: 138.0 59.7 48.0 > After: > pix_abs_0_2_neon: 109.7 50.2 39.5 > > Signed-off-by: Martin Storsjö <mar...@martin.st> > --- > libavcodec/aarch64/me_cmp_neon.S | 24 ++++++++++-------------- > 1 file changed, 10 insertions(+), 14 deletions(-) > > diff --git a/libavcodec/aarch64/me_cmp_neon.S > b/libavcodec/aarch64/me_cmp_neon.S > index 11af4849f9..832a7cb22d 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -326,9 +326,9 @@ function ff_pix_abs16_y2_neon, export=1 > // w4 int h > > // initialize buffers > + ld1 {v1.16b}, [x2], x3 // Load pix2 > movi v29.8h, #0 // clear the > accumulator > movi v28.8h, #0 // clear the > accumulator > - add x5, x2, x3 // pix2 + stride > cmp w4, #4 > b.lt 2f > > @@ -339,29 +339,25 @@ function ff_pix_abs16_y2_neon, export=1 > // avg2(a, b) = (((a) + (b) + 1) >> 1) > // abs(x) = (x < 0 ? (-x) : (x)) > > - ld1 {v1.16b}, [x2], x3 // Load pix2 for > first iteration > - ld1 {v2.16b}, [x5], x3 // Load pix3 for > first iteration > + ld1 {v2.16b}, [x2], x3 // Load pix3 for > first iteration > ld1 {v0.16b}, [x1], x3 // Load pix1 for > first iteration > urhadd v30.16b, v1.16b, v2.16b // Rounding > halving add, first iteration > - ld1 {v4.16b}, [x2], x3 // Load pix2 for > second iteration > - ld1 {v5.16b}, [x5], x3 // Load pix3 for > second iteartion > + ld1 {v5.16b}, [x2], x3 // Load pix3 for > second iteartion > uabal v29.8h, v0.8b, v30.8b // Absolute > difference of lower half, first iteration > uabal2 v28.8h, v0.16b, v30.16b // Absolute > difference of upper half, first iteration > ld1 {v3.16b}, [x1], x3 // Load pix1 for > second iteration > - urhadd v27.16b, v4.16b, v5.16b // Rounding > halving add, second iteration > - ld1 {v7.16b}, [x2], x3 // Load pix2 for > third iteration > - ld1 {v20.16b}, [x5], x3 // Load pix3 for > third iteration > + urhadd v27.16b, v2.16b, v5.16b // Rounding > halving add, second iteration > + ld1 {v20.16b}, [x2], x3 // Load pix3 for > third iteration > uabal v29.8h, v3.8b, v27.8b // Absolute > difference of lower half for second iteration > uabal2 v28.8h, v3.16b, v27.16b // Absolute > difference of upper half for second iteration > ld1 {v6.16b}, [x1], x3 // Load pix1 for > third iteration > - urhadd v26.16b, v7.16b, v20.16b // Rounding > halving add, third iteration > - ld1 {v22.16b}, [x2], x3 // Load pix2 for > fourth iteration > - ld1 {v23.16b}, [x5], x3 // Load pix3 for > fourth iteration > + urhadd v26.16b, v5.16b, v20.16b // Rounding > halving add, third iteration > + ld1 {v1.16b}, [x2], x3 // Load pix3 for > fourth iteration > uabal v29.8h, v6.8b, v26.8b // Absolute > difference of lower half for third iteration > uabal2 v28.8h, v6.16b, v26.16b // Absolute > difference of upper half for third iteration > ld1 {v21.16b}, [x1], x3 // Load pix1 for > fourth iteration > sub w4, w4, #4 // h-= 4 > - urhadd v25.16b, v22.16b, v23.16b // Rounding > halving add > + urhadd v25.16b, v20.16b, v1.16b // Rounding > halving add > cmp w4, #4 > uabal v29.8h, v21.8b, v25.8b // Absolute > difference of lower half for fourth iteration > uabal2 v28.8h, v21.16b, v25.16b // Absolute > difference of upper half for fourth iteration > @@ -372,11 +368,11 @@ function ff_pix_abs16_y2_neon, export=1 > // iterate by one > 2: > > - ld1 {v1.16b}, [x2], x3 // Load pix2 > - ld1 {v2.16b}, [x5], x3 // Load pix3 > + ld1 {v2.16b}, [x2], x3 // Load pix3 > subs w4, w4, #1 > ld1 {v0.16b}, [x1], x3 // Load pix1 > urhadd v30.16b, v1.16b, v2.16b // Rounding > halving add > + mov v1.16b, v2.16b // Shift > pix3->pix2 > uabal v29.8h, v30.8b, v0.8b > uabal2 v28.8h, v30.16b, v0.16b > > -- > 2.25.1 > > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".