On Mon, 2 Sep 2024 09:32:56 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> wrote:
>> If I run: >> >> >> @Benchmark >> public long shift() { >> return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | >> ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | >> ELEM_SIZE; >> } >> >> @Benchmark >> public long mul() { >> return ELEM_SIZE * 0xFFFF_FFFF_FFFFL; >> } >> >> Then I get: >> >> Benchmark (ELEM_SIZE) Mode Cnt Score Error Units >> TestFill.mul 31 avgt 30 0.586 ? 0.045 ns/op >> TestFill.shift 31 avgt 30 0.938 ? 0.017 ns/op >> >> On my M1 machine. > > I found similar small improvements to be had (I wrote about them offline) > when replacing the bitwise-based tests (e.g. `foo & 4 != 0`) with a more > explicit check for `remainingBytes >=4`. Seems like bitwise operations are > not as optimized (or perhaps the assembly instructions for them is overall > more convoluted - I haven't checked). I've tried final long longValue = Byte.toUnsignedLong(value) * 0x0101010101010101L; But it had the same performance as explicit bit shifting on M1. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1741664877