Re: RFR: 8338967: Improve performance for MemorySegment::fill [v5]

Francesco Nigro Fri, 30 Aug 2024 07:21:49 -0700

On Fri, 30 Aug 2024 12:15:36 GMT, Per Minborg <[email protected]> wrote:


>> @minborg Hi! I didn't checked the numbers with the benchmark I've written at 
>> https://github.com/openjdk/jdk/pull/20712#discussion_r1732802685 which is 
>> meant to stress the branch predictor (without enough `samples` i.e. past 
>> 128K on my machine) - can you give it a shot with M1 🙏 ?
>
> @franz1981 Here is what I get if I run your performance test on my M1 Mac 
> (unfortunately no -perf data):
> 
> 
> Base
> Benchmark                         (samples)  (shuffle)  Mode  Cnt        
> Score        Error  Units
> TestBranchFill.heap_segment_fill       1024      false  avgt   30    
> 58597.625 ?   1871.313  ns/op
> TestBranchFill.heap_segment_fill       1024       true  avgt   30    
> 64309.859 ?   1164.360  ns/op
> TestBranchFill.heap_segment_fill     128000      false  avgt   30  
> 7136796.445 ? 152120.060  ns/op
> TestBranchFill.heap_segment_fill     128000       true  avgt   30  
> 7908474.120 ?  49184.950  ns/op
> 
> 
> Patch
> Benchmark                         (samples)  (shuffle)  Mode  Cnt        
> Score       Error  Units
> TestBranchFill.heap_segment_fill       1024      false  avgt   30     
> 3695.815 ?    24.615  ns/op
> TestBranchFill.heap_segment_fill       1024       true  avgt   30     
> 3938.582 ?   124.510  ns/op
> TestBranchFill.heap_segment_fill     128000      false  avgt   30   
> 420845.301 ?  1605.080  ns/op
> TestBranchFill.heap_segment_fill     128000       true  avgt   30  
> 1778362.506 ? 39250.756  ns/op

Thanks @minborg to run it, so it seems that 128K, despite the additional call 
(due to not inlining something), makes nuking the pipeline of M1 a severe 
affair:

Patch
Benchmark                         (samples)  (shuffle)  Mode  Cnt        Score  
     Error  Units
TestBranchFill.heap_segment_fill     128000      false  avgt   30   420845.301 
?  1605.080  ns/op
TestBranchFill.heap_segment_fill     128000       true  avgt   30  1778362.506 
? 39250.756  ns/op <----- HERE!

now the interesting thing...there's really some other non-branchy way to handle 
this? is it worthy?
I sadly have not much answers on this, since when I make something similar to 
Netty at https://github.com/netty/netty/pull/13693 I've decided for a more 
drastic approach, see 
https://github.com/netty/netty/pull/13693/files#diff-49ee8d7612d5ecfcc27b46c38a801ad32ebdb169f7d79f1577313a1de70b0fbbR639-R649

TLDR:
- modern x86 are decent to fill unaligned data in, but non-x86, not so great: 
that lead me to handle alignment; for off-heap memory, clearly
- use 2 loops to reduce the branches (amortize, let's say) and hope the 1-7 
bytes will work decently with unrolling and placing bytes singularly - but 
basically leveraging `Unsafe` for both
- the cutoff value is much higher because of pre jdk 21 memset suboptimal impl 
(now fixed in main by @asgibbons )
- the array/heap case was already handled by `Arrays::fill`

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20712#issuecomment-2321405596

Re: RFR: 8338967: Improve performance for MemorySegment::fill [v5]

Reply via email to