On Mon, 7 Mar 2022, Pop, Sebastian wrote:
Here are a few suggestions:
+add d18, d17, d18 // add to the end result register
[...]
+mov w0, v18.S[0]// copy result to general purpose
register
I think you can use 32-bit register s18 instead
On Mon, 7 Mar 2022, Swinney, Jonathan wrote:
- ff_pix_abs16_neon
- ff_pix_abs16_xy2_neon
In direct micro benchmarks of these ff functions verses their C implementations,
these functions performed as follows on AWS Graviton 2:
ff_pix_abs16_neon:
c: benchmark ran 10 iterations in 0.955383 s