> Thanks. I noticed that this stuff is simple enough that we can use > port/simd.h (with a few added functions). This is especially nice because > it takes care of x86, too. The performance gains look similar to what you > reported for v6:
This looks good, much cleaner. One possible improvement would be to use a vectorized table lookup instead of compare and add. I compared v6 and v7 Neon versions, and v6 is always faster. I’m not sure if SSE2 has a table lookup similar to Neon. arm - m7g.4xlarge buf | v6-Neon| v7-Neon| % diff -------+--------+--------+-------- 64 | 6.16 | 8.57 | 28.07 128 | 11.37 | 15.77 | 27.87 256 | 18.54 | 30.28 | 38.77 512 | 33.98 | 62.15 | 45.33 1024 | 64.46 | 117.55 | 45.16 2048 | 124.28 | 254.86 | 51.24 4096 | 243.47 | 509.23 | 52.19 8192 | 487.34 | 953.81 | 48.91 ----- Chiranmoy