On Thu, Aug 22, 2024 at 1:29 PM Ramiro Polla <ramiro.po...@gmail.com> wrote: > On Wed, Aug 21, 2024 at 9:44 PM Martin Storsjö <mar...@martin.st> wrote: > > On Wed, 21 Aug 2024, Ramiro Polla wrote: > > >> BTW, this instruction is kinda exotic and the docs aren't super clear, so > > >> it'd be good to test manually that it really does what we want, for > > >> negative numbers and numbers close to the ends of the value range; I > > >> didn't do that manually yet. > > > > > > I prefer just sticking to sxtw + lsl then. When we move to ptrdiff_t > > > the sxtw will be gone anyway. > > > > This sounds like a very reasonable choice indeed, especially if it's > > somewhat plausible that we'll get rid of it at some point in the future. > > > > >>> + movi v0.16b, #0 > > >>> + mov w3, #16 > > >>> + > > >>> +1: > > >>> + ld1 {v1.16b}, [x0], x1 > > >>> + ld1 {v2.16b}, [x2], x1 > > >>> + subs w3, w3, #2 > > >>> + uadalp v0.8h, v1.16b > > >>> + uadalp v0.8h, v2.16b > > >>> + b.ne 1b > > >>> + > > >>> + uaddlv s0, v0.8h > > >>> + fmov w0, s0 > > >>> + > > >>> + ret > > >>> +endfunc > > >>> + > > >>> +function ff_pix_norm1_neon, export=1 > > >>> +// x0 const uint8_t *pix > > >>> +// x1 int line_size > > >>> + > > >>> + sxtw x1, w1 > > >>> + movi v4.16b, #0 > > >>> + movi v5.16b, #0 > > >>> + mov w2, #16 > > >>> + > > >>> +1: > > >>> + ld1 {v1.16b}, [x0], x1 > > >>> + subs w2, w2, #1 > > >>> + umull v2.8h, v1.8b, v1.8b > > >>> + umull2 v3.8h, v1.16b, v1.16b > > >>> + uadalp v4.4s, v2.8h > > >>> + uadalp v5.4s, v3.8h > > >> > > >> From my earlier testing on A53, it seemed (surprisingly) to be equally > > >> fast to accumulate into the same register for both instructions - but I > > >> only tested that on A53. So we could change that here, getting rid of the > > >> add at the end (and one movi). Or if it does help on some other core, > > >> perhaps we should do the same for the function above too? > > > > > > Indeed, it is equally fast to accumulate into the same register on the > > > A55 and A76 as well. > > > > > > New patches attached (patch 3/7 has functional changes, but patch 4/7 > > > only changes the commit message to reflect the new test run). > > > > LGTM very much now, thanks! And thanks for your patience through all the > > iterations on such trivial patches as these. > > And thank you for your patience through the reviews :). I'm slowly > getting up to speed with aarch64 and neon. > > I'll apply the pix_sum and pix_norm1 patches, and I'll wait a few days > for any comments on the draw_edges patches.
Applied. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".