On Tue, 12 Nov 2024 10:17:45 GMT, Francesco Nigro <d...@openjdk.org> wrote:

>> @minborg sent me some logs from his machine, and I'm analyzing them now.
>> 
>> Basically, I'm trying to see why your Java code is a bit faster than the 
>> Loop code.
>> 
>> ----------------
>> 
>>   44.77%                c2, level 4  
>> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
>>  version 4, compile id 946
>>   24.43%                c2, level 4  
>> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
>>  version 4, compile id 946
>>   21.80%                c2, level 4  
>> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
>>  version 4, compile id 946
>> 
>> There seem to be 3 hot regions.
>> 
>> **main-loop** (region has 44.77%):
>> 
>>              ;; B33: #  out( B33 B34 ) &lt;- in( B32 B33 ) Loop( B33-B33 
>> inner main of N116 strip mined) Freq: 4.62951e+10                            
>>                               
>>    0.50%  ?   0x00000001149a23c0:   sxtw        x20, w4                      
>>                                                                              
>>                           
>>           ?   0x00000001149a23c4:   add x22, x16, x20                        
>>                                                                              
>>                           
>>    0.02%  ?   0x00000001149a23c8:   str q16, [x22]                           
>>                                                                              
>>                           
>>   16.33%  ?   0x00000001149a23cc:   str q16, [x22, #16]             
>> ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}                 
>>                                    
>>           ?                                                             ; - 
>> jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)          
>>                            
>>           ?                                                             ; - 
>> jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)                   
>>                            
>>           ?                                                             ; - 
>> java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)                  
>>                            
>>           ?                                                             ; - 
>> java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20          
>>                            
>>           ?                                                             ; - 
>> java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37           ...
>
> @eme64  not an expert with ARM, but profiling skidding due to modern big 
> pipelined OOO CPUs is rather frequent
> 
>> with a strange extra add that has some strange looking percentage (profile 
>> inaccuracy?):
> 
> you should check some instr below it to get the real culprit
> 
> More info on this topic are:
> - https://travisdowns.github.io/blog/2019/08/20/interrupts.html for x86
> - 
> https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR#processor-event-based-sampling-pebs
> - https://ieeexplore.ieee.org/document/10068807 - Intel and AMD PEBS/IBS paper
> 
> If you uses Intel/AMD and PEBS/IBS (if supported by your cpu) you can run 
> perfasm using precise events via `perfasm:events=cycles:P` IIRC (or adding 
> more Ps? @shipilev likely knows) which should have way less skidding and will 
> simplify these analysis.

@franz1981 right. That is what I thought. I'm usually working on x64, and am 
not used to all the skidding of ARM.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470162785

Reply via email to