On Mon, 2 Sep 2024 08:56:47 GMT, Per Minborg <pminb...@openjdk.org> wrote:
>>> this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure >>> if fast(er), need to measure. >>> >>> Most of the time filling is happy with 0 since zeroing is the most common >>> case >> >> It's a clever trick. However, I was looking at similar tricks and found that >> the time spent here is irrelevant (e.g. I tried to always force `0` as the >> value, and couldn't see any difference). > > If I run: > > > @Benchmark > public long shift() { > return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | > ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | > ELEM_SIZE; > } > > @Benchmark > public long mul() { > return ELEM_SIZE * 0xFFFF_FFFF_FFFFL; > } > > Then I get: > > Benchmark (ELEM_SIZE) Mode Cnt Score Error Units > TestFill.mul 31 avgt 30 0.586 ? 0.045 ns/op > TestFill.shift 31 avgt 30 0.938 ? 0.017 ns/op > > On my M1 machine. I found similar small improvements to be had (I wrote about them offline) when replacing the bitwise-based tests (e.g. `foo & 4 != 0`) with a more explicit check for `remainingBytes >=4`. Seems like bitwise operations are not as optimized (or perhaps the assembly instructions for them is overall more convoluted - I haven't checked). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1740612559