Hi Clément, please find attached the updated patch addressing all your comments. Let me know if there is anything else that I missed and that I need to address.
Thanks, Sebastian On Sun, Dec 1, 2019 at 3:01 PM Martin Storsjö <mar...@martin.st> wrote: > > On Sun, 1 Dec 2019, Clément Bœsch wrote: > > > On Wed, Nov 27, 2019 at 12:30:35PM -0600, Sebastian Pop wrote: > > [...] > >> From 9ecaa99fab4b8bedf3884344774162636eaa5389 Mon Sep 17 00:00:00 2001 > >> From: Sebastian Pop <s...@amazon.com> > >> Date: Sun, 17 Nov 2019 14:13:13 -0600 > >> Subject: [PATCH] [aarch64] use FMA and increase vector factor to 4 > >> > >> This patch implements ff_hscale_8_to_15_neon with NEON fused multiply > >> accumulate > >> and bumps the vectorization factor from 2 to 4. > >> The speedup is of 34% on Graviton A1 instances based on A-72 cpus: > >> > >> $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf > >> bench=start,scale=1024x1024,bench=stop -f null - > >> before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214 > >> after: t:0.030079 avg:0.030102 max:0.030462 min:0.030051 > >> > >> Tested with `make check` on aarch64-linux. > > FWIW, I'm not certain how much this routine actually is tested by that - > in particular, there's no checkasm test for it as far as I can see. > > >> + add x17, x3, x18 // srcp + > >> filterPos[0] > >> + add x18, x3, x0 // srcp + > >> filterPos[1] > >> + add x0, x3, x2 // srcp + > >> filterPos[2] > >> + add x2, x3, x6 // srcp + > >> filterPos[3] > > > >> +2: ldr d4, [x17, x15] // > >> srcp[filterPos[0] + {0..7}] > >> + ldr q5, [x16] // load 8x16-bit > >> filter values, part 1 > >> + ldr d6, [x18, x15] // > >> srcp[filterPos[1] + {0..7}] > >> + ldr q7, [x16, x12] // load 8x16-bit > >> at filter+filterSize > > > > Why not use ld1 {v4.8B} etc like it was before? The use of Dn/Qn in is > > very confusing here. > > The ldr instruction, instead of ld1, allows you to to do a load (or store, > similarly, for str instead of st1) with a constant/register offset, like > [x17, x15] here, without incrementing the source register inbetween for > each load (which can help with latency between individual load > instructions, or can avoid extra instructions for incrementing the > register inbetween). > > That works for loading the first 1/2/4/8/16 bytes of a vector, but can't > be used e.g. for loading a lane other than the first (e.g. ld1 {v4.s}[1]). > But it does require using the b/h/s/d/q names for the registers instead of > v. > > I didn't check the changes here if they're essential for the optimization > though. > > // Martin > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
0001-aarch64-use-FMA-and-increase-vector-factor-to-4.patch
Description: Binary data
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".