On Fri, 30 Aug 2024 12:15:36 GMT, Per Minborg <pminb...@openjdk.org> wrote:
>> @minborg Hi! I didn't checked the numbers with the benchmark I've written at >> https://github.com/openjdk/jdk/pull/20712#discussion_r1732802685 which is >> meant to stress the branch predictor (without enough `samples` i.e. past >> 128K on my machine) - can you give it a shot with M1 🙏 ? > > @franz1981 Here is what I get if I run your performance test on my M1 Mac > (unfortunately no -perf data): > > > Base > Benchmark (samples) (shuffle) Mode Cnt > Score Error Units > TestBranchFill.heap_segment_fill 1024 false avgt 30 > 58597.625 ? 1871.313 ns/op > TestBranchFill.heap_segment_fill 1024 true avgt 30 > 64309.859 ? 1164.360 ns/op > TestBranchFill.heap_segment_fill 128000 false avgt 30 > 7136796.445 ? 152120.060 ns/op > TestBranchFill.heap_segment_fill 128000 true avgt 30 > 7908474.120 ? 49184.950 ns/op > > > Patch > Benchmark (samples) (shuffle) Mode Cnt > Score Error Units > TestBranchFill.heap_segment_fill 1024 false avgt 30 > 3695.815 ? 24.615 ns/op > TestBranchFill.heap_segment_fill 1024 true avgt 30 > 3938.582 ? 124.510 ns/op > TestBranchFill.heap_segment_fill 128000 false avgt 30 > 420845.301 ? 1605.080 ns/op > TestBranchFill.heap_segment_fill 128000 true avgt 30 > 1778362.506 ? 39250.756 ns/op Thanks @minborg to run it, so it seems that 128K, despite the additional call (due to not inlining something), makes nuking the pipeline of M1 a severe affair: Patch Benchmark (samples) (shuffle) Mode Cnt Score Error Units TestBranchFill.heap_segment_fill 128000 false avgt 30 420845.301 ? 1605.080 ns/op TestBranchFill.heap_segment_fill 128000 true avgt 30 1778362.506 ? 39250.756 ns/op <----- HERE! now the interesting thing...there's really some other non-branchy way to handle this? is it worthy? I sadly have not much answers on this, since when I make something similar to Netty at https://github.com/netty/netty/pull/13693 I've decided for a more drastic approach, see https://github.com/netty/netty/pull/13693/files#diff-49ee8d7612d5ecfcc27b46c38a801ad32ebdb169f7d79f1577313a1de70b0fbbR639-R649 TLDR: - modern x86 are decent to fill unaligned data in, but non-x86, not so great: that lead me to handle alignment; for off-heap memory, clearly - use 2 loops to reduce the branches (amortize, let's say) and hope the 1-7 bytes will work decently with unrolling and placing bytes singularly - but basically leveraging `Unsafe` for both - the cutoff value is much higher because of pre jdk 21 memset suboptimal impl (now fixed in main by @asgibbons ) - the array/heap case was already handled by `Arrays::fill` ------------- PR Comment: https://git.openjdk.org/jdk/pull/20712#issuecomment-2321405596