On Mon, 2 Sep 2024 08:56:47 GMT, Per Minborg <pminb...@openjdk.org> wrote:

>>> this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure 
>>> if fast(er), need to measure.
>>> 
>>> Most of the time filling is happy with 0 since zeroing is the most common 
>>> case
>> 
>> It's a clever trick. However, I was looking at similar tricks and found that 
>> the time spent here is irrelevant (e.g. I tried to always force `0` as the 
>> value, and couldn't see any difference).
>
> If I run:
> 
> 
>     @Benchmark
>     public long shift() {
>         return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | 
> ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | 
> ELEM_SIZE;
>     }
> 
>     @Benchmark
>     public long mul() {
>         return ELEM_SIZE * 0xFFFF_FFFF_FFFFL;
>     }
> 
> Then I get:
> 
> Benchmark       (ELEM_SIZE)  Mode  Cnt  Score   Error  Units
> TestFill.mul             31  avgt   30  0.586 ? 0.045  ns/op
> TestFill.shift           31  avgt   30  0.938 ? 0.017  ns/op
> 
> On my M1 machine.

I found similar small improvements to be had (I wrote about them offline) when 
replacing the bitwise-based tests (e.g. `foo & 4 != 0`) with a more explicit 
check for `remainingBytes >=4`. Seems like bitwise operations are not as 
optimized (or perhaps the assembly instructions for them is overall more 
convoluted - I haven't checked).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1740612559

Reply via email to