On Thu, 15 May 2025 16:03:44 GMT, Andrew Haley <a...@openjdk.org> wrote:
>> This intrinsic is generally faster than the current implementation for >> Panama segment operations for all writes larger than about 8 bytes in size, >> increasing to more than 2* the performance on larger memory blocks on >> Graviton 2, between "panama" (C2 generated, what we use now) and "unsafe" >> (this intrinsic). >> >> >> Benchmark (aligned) (size) Mode Cnt Score >> Error Units >> MemorySegmentFillUnsafe.panama true 262143 avgt 10 7295.638 ± >> 0.422 ns/op >> MemorySegmentFillUnsafe.panama false 262143 avgt 10 8345.300 ± >> 80.161 ns/op >> MemorySegmentFillUnsafe.unsafe true 262143 avgt 10 2930.594 ± >> 0.180 ns/op >> MemorySegmentFillUnsafe.unsafe false 262143 avgt 10 3136.828 ± >> 0.232 ns/op > > Andrew Haley has updated the pull request incrementally with one additional > commit since the last revision: > > Copyright format correction > One thing that sometimes helps is a count leading zeroes followed by a > multiway switch at the start, or just before the tail, to get started at the > right place in the tail (its log-size cascade), for very small inputs. > > This PR #25383 uses clz in that way. > > It also uses an overlapping-store technique to reduce an O(lg N) tail to an > O(1) tail, which also depends on the clz step. > My rough notes on the relative performance of overlapping loads and stores > are here FWIW: https://cr.openjdk.org/~jrose/jvm/PartialMemoryWord.cpp Mmm, interesting. I had a look at the M1 timings to see what might be going on, and I think it's because the processor can in each clock execute 8 instructions but only one taken branch. It can, however, execute two not-taken branches per clock. At present, if our block is 1 (mod 64) bytes long, then we have a string of 5 taken branches and 1 not-taken branch. However, I couldn't see anything in the JMH results. I realized on closer inspection that performance was very much limited by the C2-generated caller, which is doing far more work than the intrinsic itself. I eventually tweaked the benchmark to call the intrinsic 1000 times, and trivially converting us to ns. I'm not keen on the overlapping-store technique in this case because the code gets IMHO unjustifiably complex, but also we would have different timing behaviour for differently aligned fill operations. This seems to me a bit much for something that should be fairly simple. I did, however, implement the clz-optimized tail. It's great for very short strings (mod 64) but it's worse for the range 32...63 (mod 64). It's also missing the early exit from the log-size cascade, which short-circuits fills that are a whole number of words long. I tried another thing, which was to have _two_ cascades: one of whole-word-sized stores, and one from 0 to 7 bytes. This was better for fills that are a whole number of words long and some other cases, but had its own timing spikes in a few places (e.g. 36 bytes.) I measured the total time for arrays of size 1-128 bytes, and took the average throughput. A: this PR, as checked in. 6.5 cycles/op, 1.8ns. B: one clz-optimized tail. 6.9 cycles/op, 1.9ns. C: two clz-optimized tails. 7.2 cycles/op, 2.0ns. In conclusion: there isn't much in it. We could do better by keeping this code as short as possible, which would allow us to inline the whole thing into its caller rather than the palaver of a C-ABI call to the stub. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25147#issuecomment-2913030626