On Wed, 9 Apr 2025 12:48:10 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:

>> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101:
>> 
>>> 99: }
>>> 100: 
>>> 101: void 
>>> G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* 
>>> masm, DecoratorSet decorators,
>> 
>> Have you measured the performance impact of inlining this assembly code 
>> instead of resorting to a runtime call as done before? Is it worth the 
>> maintenance cost (for every platform), risk of introducing bugs, etc.?
>
> I remember significant impact in some microbenchmark. It's also inlined in 
> Parallel GC. I do not consider it a big issue wrt to maintenance - these 
> things never really change, and the method is small and contained.
> I will try to redo numbers.

>From our microbenchmarks (higher numbers are better):

Current code:

Benchmark                                    (size)   Mode  Cnt       Score     
 Error   Units
ArrayCopyObject.conjoint_micro                   31  thrpt   15  166136.959 ± 
5517.157  ops/ms
ArrayCopyObject.conjoint_micro                   63  thrpt   15  108880.108 ± 
4331.112  ops/ms
ArrayCopyObject.conjoint_micro                  127  thrpt   15   93159.977 ± 
5025.458  ops/ms
ArrayCopyObject.conjoint_micro                 2047  thrpt   15   17234.842 ±  
831.344  ops/ms
ArrayCopyObject.conjoint_micro                 4095  thrpt   15    9202.216 ±  
292.612  ops/ms
ArrayCopyObject.conjoint_micro                 8191  thrpt   15    3565.705 ±  
121.116  ops/ms
ArrayCopyObject.disjoint_micro                   31  thrpt   15  159106.245 ± 
5965.576  ops/ms
ArrayCopyObject.disjoint_micro                   63  thrpt   15   95475.658 ± 
5415.267  ops/ms
ArrayCopyObject.disjoint_micro                  127  thrpt   15   84249.979 ± 
6313.007  ops/ms
ArrayCopyObject.disjoint_micro                 2047  thrpt   15   10682.650 ±  
381.832  ops/ms
ArrayCopyObject.disjoint_micro                 4095  thrpt   15    4471.940 ±  
216.439  ops/ms
ArrayCopyObject.disjoint_micro                 8191  thrpt   15    1378.296 ±   
33.421  ops/ms
ArrayCopy.arrayCopyObject                       N/A   avgt   15      13.880 ±   
 0.517   ns/op
ArrayCopy.arrayCopyObjectNonConst               N/A   avgt   15      14.844 ±   
 0.751   ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward     N/A   avgt   15      11.080 ±   
 0.703   ns/op
ArrayCopy.arrayCopyObjectSameArraysForward      N/A   avgt   15      11.003 ±   
 0.135   ns/op

Runtime call:

Benchmark                                    (size)   Mode  Cnt      Score      
 Error   Units
ArrayCopyObject.conjoint_micro                   31  thrpt   15  73100.230 ± 
11079.381  ops/ms
ArrayCopyObject.conjoint_micro                   63  thrpt   15  65039.431 ±  
1996.832  ops/ms
ArrayCopyObject.conjoint_micro                  127  thrpt   15  58336.711 ±  
2260.660  ops/ms
ArrayCopyObject.conjoint_micro                 2047  thrpt   15  17035.419 ±   
524.445  ops/ms
ArrayCopyObject.conjoint_micro                 4095  thrpt   15   9207.661 ±   
286.526  ops/ms
ArrayCopyObject.conjoint_micro                 8191  thrpt   15   3264.491 ±    
73.848  ops/ms
ArrayCopyObject.disjoint_micro                   31  thrpt   15  84587.219 ±  
3007.310  ops/ms
ArrayCopyObject.disjoint_micro                   63  thrpt   15  62815.254 ±  
1214.310  ops/ms
ArrayCopyObject.disjoint_micro                  127  thrpt   15  58423.470 ±   
285.670  ops/ms
ArrayCopyObject.disjoint_micro                 2047  thrpt   15  10720.462 ±   
617.173  ops/ms
ArrayCopyObject.disjoint_micro                 4095  thrpt   15   4178.195 ±   
178.942  ops/ms
ArrayCopyObject.disjoint_micro                 8191  thrpt   15   1374.268 ±    
44.290  ops/ms
ArrayCopy.arrayCopyObject                       N/A   avgt   15     19.667 ±    
 0.740   ns/op
ArrayCopy.arrayCopyObjectNonConst               N/A   avgt   15     21.243 ±    
 1.891   ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward     N/A   avgt   15     16.645 ±    
 0.504   ns/op
ArrayCopy.arrayCopyObjectSameArraysForward      N/A   avgt   15     17.409 ±    
 0.705   ns/op

Obviously with larger arrays, the impact diminishes, but it's always there. I 
think the inlined code is worth the effort in this case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2037086410

Reply via email to