On Mon, 11 Nov 2024 14:51:06 GMT, Per Minborg <pminb...@openjdk.org> wrote:
>> Thanks @minborg for this :) Please remember to add the misprediction count >> if you can and avoid the bulk methods by having a `nextMemorySegment()` >> benchmark method which make a single fill call site to observe the different >> segments (types). >> >> Having separate call-sites which observe always the same type(s) "could" be >> too lucky (and gentle) for the runtime (and CHA) and would favour to have a >> single address entry (or few ones, if we include any optimization for the >> fill size) in the Branch Target Buffer of the cpu. > >> Thanks @minborg for this :) Please remember to add the misprediction count >> if you can and avoid the bulk methods by having a `nextMemorySegment()` >> benchmark method which make a single fill call site to observe the different >> segments (types). >> >> Having separate call-sites which observe always the same type(s) "could" be >> too lucky (and gentle) for the runtime (and CHA) and would favour to have a >> single address entry (or few ones, if we include any optimization for the >> fill size) in the Branch Target Buffer of the cpu. > > I've added a "mixed" benchmark. I am not sure I understood all of your > comments but given my changes, maybe you could elaborate a bit more? @minborg sent me some logs from his machine, and I'm analyzing them now. Basically, I'm trying to see why your Java code is a bit faster than the Loop code. ---------------- 44.77% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946 24.43% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946 21.80% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946 There seem to be 3 hot regions. **main-loop** (region has 44.77%): ;; B33: # out( B33 B34 ) <- in( B32 B33 ) Loop( B33-B33 inner main of N116 strip mined) Freq: 4.62951e+10 0.50% ? 0x00000001149a23c0: sxtw x20, w4 ? 0x00000001149a23c4: add x22, x16, x20 0.02% ? 0x00000001149a23c8: str q16, [x22] 16.33% ? 0x00000001149a23cc: str q16, [x22, #16] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0} ? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534) ? ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522) ? ; - java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114) ? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20 ? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37 ? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017) ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670) ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44 (line 101) ? 0x00000001149a23d0: add w4, w4, #0x20 0.06% ? 0x00000001149a23d4: cmp w4, w10 ? 0x00000001149a23d8: b.lt 0x00000001149a23c0 // b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0} **post-loops**: the "vectorized post-loop" and the "single iteration post-loop" (region has 24.43%): vectorized post-loop (inner post) ? ? ;; B14: # out( B14 B15 ) <- in( B35 B14 ) Loop( B14-B14 inner post of N1915) Freq: 174420 2.20% ?? ? 0x00000001149a224c: sxtw x5, w4 0.88% ?? ? 0x00000001149a2250: str q16, [x16, x5] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0} ?? ? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534) ?? ? ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522) ?? ? ; - java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114) ?? ? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20 ?? ? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37 ?? ? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017) ?? ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670) ?? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44 (line 101) ?? ? 0x00000001149a2254: add w4, w4, #0x10 ?? ? 0x00000001149a2258: cmp w4, w10 ?? ? 0x00000001149a225c: b.lt 0x00000001149a224c // b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0} ? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33 (line 100) ? ? ;; B15: # out( B16 ) <- in( B14 ) Freq: 87210.2 0.34% ? ? 0x00000001149a2260: add x10, x19, x5 ? ? 0x00000001149a2264: add x22, x10, #0x10 ;*ladd {reexecute=0 rethrow=0 return_oop=0} ? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@52 (line 100) ? ? ;; B16: # out( B20 B17 ) <- in( B39 B15 B36 ) top-of-loop Freq: 174421 0.78% ? ? 0x00000001149a2268: cmp w4, w3 ? ? ? 0x00000001149a226c: b.ge 0x00000001149a2294 // b.tcont ? ? ? ;; B17: # out( B42 B18 ) <- in( B16 ) Freq: 87210.3 ? ? ? 0x00000001149a2270: cmp w4, w2 ? ? ? 0x00000001149a2274: b.cs 0x00000001149a24a4 // b.hs, b.nlast ? ? ? ;*aload {reexecute=0 rethrow=0 return_oop=0} ? ? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@36 (line 101) scalar post loop: ? ? ? ;; B18: # out( B18 B19 ) <- in( B17 B18 ) Loop( B18-B18 inner post of N1402) Freq: 174420 0.56% ? ??? 0x00000001149a2278: sxtw x10, w4 5.47% ? ??? 0x00000001149a227c: strb wzr, [x16, x10, lsl #0] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0} ? ??? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534) ? ??? ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522) ? ??? ; - java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114) ? ??? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20 ? ??? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37 ? ??? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017) ? ??? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670) ? ??? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44 (line 101) ? ??? 0x00000001149a2280: add w4, w4, #0x1 ? ??? 0x00000001149a2284: cmp w4, w3 ? ??? 0x00000001149a2288: b.lt 0x00000001149a2278 // b.tstop Not sure why we have this below... probably the check that leads to the post-loop? ? ? ? ;; B19: # out( B20 ) <- in( B18 ) Freq: 87210.2 8.88% ? ? ? 0x00000001149a228c: add x10, x10, x19 ? ? ? 0x00000001149a2290: add x22, x10, #0x1 ;*ifge {reexecute=0 rethrow=0 return_oop=0} ? ? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33 (line 100) ? ? ? ;; B20: # out( B2 B21 ) <- in( B23 B19 B16 ) Freq: 174760 0.78% ? ? ? 0x00000001149a2294: cmp x22, x7 ? ? 0x00000001149a2298: b.ge 0x00000001149a219c // b.tcont **pre-loop** (region has 21.80%): ;; B27: # out( B29 B28 ) <- in( B26 B28 ) Loop( B27-B28 inner pre of N1402) Freq: 348842 0.10% ? 0x00000001149a2364: sxtw x22, w10 6.01% ? 0x00000001149a2368: strb wzr, [x16, x22, lsl #0] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0} ? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534) ? ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522) ? ; - java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114) ? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20 ? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37 ? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V@134 (line 1017) ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set@10 (line 670) ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@44 (line 101) 0.08% ? 0x00000001149a236c: add w4, w10, #0x1 0.56% ? 0x00000001149a2370: cmp w4, w20 0.04% ?? 0x00000001149a2374: b.ge 0x00000001149a2380 // b.tcont;*ifge {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop@33 (line 100) ?? ;; B28: # out( B27 ) <- in( B27 ) Freq: 174421 5.61% ?? 0x00000001149a2378: mov w10, w4 ?? 0x00000001149a237c: b 0x00000001149a2364 with a strange extra add that has some strange looking percentage (profile inaccuracy?): 7.88% ? 0x00000001149a2380: add w10, w10, #0x20 **Summary**: pre-loop: 22%, byte-store main-loop: 40% 2x 16-byte-vector-store (profiling is a bit contradictory here - is it 16% or 44%?) vectorized post-loop: 4% 1x 16-byte-vector-store (not super sure about profiling, but could be accurate) post-loop: 12% byte-store The numbers don't quite add up - but they are still somewhat telling - and I think probably accurate enough to see what happens. Basically: we waste a lot of time in the pre and post-loop: getting alignment and then finishing off at the end. ------------------- And to compare: 58.00% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, version 5, compile id 848 29.83% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, version 5, compile id 848 We have 2 hot regions. **main** (58%): ;; B40: # out( B40 B41 ) <- in( B39 B40 ) Loop( B40-B40 inner main of N140 strip mined) Freq: 2.13696e+08 0.26% ? 0x000000011800f900: add x4, x1, w3, sxtw ? ;; merged str pair ? 0x000000011800f904: stp xzr, xzr, [x4] ? 0x000000011800f908: str xzr, [x4, #16] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0} ? ; - jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677) ? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605) ? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593) ? ; - jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78) ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184) ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14 (line 83) ? 0x000000011800f90c: add w3, w3, #0x20 ;*iinc {reexecute=0 rethrow=0 return_oop=0} ? ; - jdk.internal.foreign.SegmentBulkOperations::fill@136 (line 77) ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184) ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14 (line 83) 21.73% ? 0x000000011800f910: str xzr, [x4, #24] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0} ? ; - jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677) ? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605) ? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593) ? ; - jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78) ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184) ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14 (line 83) 0.17% ? 0x000000011800f914: cmp w3, w2 2.58% ? 0x000000011800f918: b.lt 0x000000011800f900 // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - jdk.internal.foreign.SegmentBulkOperations::fill@98 (line 77) ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184) ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14 (line 83) ;; B41: # out( B39 B42 ) <- in( B40 ) Freq: 3.29583e+06 26.13% 0x000000011800f91c: ldr x2, [x28, #48] ; ImmutableOopMap {r12=Oop r14=Oop c_rarg1=Derived_oop_r14 r15=Oop r16=Oop } **Rest**: vectorized post-loop ;; B2: # out( B2 B3 ) <- in( B42 B2 ) Loop( B2-B2 inner post of N1701) Freq: 50831.6 3.01% ? 0x000000011800f728: str xzr, [x1, w3, sxtw] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0} ? ; - jdk.internal.misc.Unsafe::putLongUnaligned@10 (line 3677) ? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal@17 (line 2605) ? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned@8 (line 2593) ? ; - jdk.internal.foreign.SegmentBulkOperations::fill@133 (line 78) ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184) ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14 (line 83) ? 0x000000011800f72c: add w3, w3, #0x8 ;*iinc {reexecute=0 rethrow=0 return_oop=0} ? ; - jdk.internal.foreign.SegmentBulkOperations::fill@136 (line 77) ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184) ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14 (line 83) ? 0x000000011800f730: cmp w3, w10 ? 0x000000011800f734: b.lt 0x000000011800f728 // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - jdk.internal.foreign.SegmentBulkOperations::fill@98 (line 77) ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184) ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava@14 (line 83) ;; B3: # out( B5 B4 ) <- in( B2 B43 B44 ) top-of-loop Freq: 51627.8 ... and then the rest of the code I speculate is your **long-int-short-byte wind-down code**. ----------------------- **Conclusion:** Java: spends about 58% in well vectorized main-loop code (2x super-unrolled, i.e. 2x 16-byte-vectors) Loop: only spends about 40% in main loop (also 2x 16-byte vectors) - the rest is spent in pre/post-loops Hmm. This really makes me want to ditch the alignment-code - it may hurt more than we gain from it :thinking: And we should also consider such "wind-down" code: going from 16-element vectors to 8, 4, 2, 1 elements. Of course that is extra code and extra compile time... ------------- PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470102192