On Mon, 26 Aug 2024 13:29:40 GMT, Per Minborg <pminb...@openjdk.org> wrote:
>> src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java >> line 200: >> >>> 198: switch ((int) length) { >>> 199: case 0 : checkReadOnly(false); checkValidState(); >>> break; // Explicit tests >>> 200: case 1 : set(JAVA_BYTE, 0, value); break; >> >> beware using a switch, because if this code if is too big to be inlined (or >> we're unlucky) will die due to branch-mispredict in case the different >> "small fills" are unstable/unpredictable. >> Having a test which feed different fill sizes per each iteration + counting >> branch misses, will reveal if the improvement is worthy even with such cases > > It is true, that this is a compromise where we give up inline space, > code-cache space, and introduce added complexity against the prospect of > better small-size performance. Depending on the workload, this may or may not > pay off. In the (presumably common) case where we allocate/fill small > segments of constant sizes, this is likely a win. Writing a dynamic > performance test sounds like a good idea. Here is a benchmark that fills segments of various random sizes: @BenchmarkMode(Mode.AverageTime) @Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS) @Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS) @State(Scope.Thread) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Fork(value = 3) public class TestFill { private static final int SIZE = 16; private static final int[] INDICES = new Random(42).ints(0, 8) .limit(SIZE) .toArray(); private MemorySegment[] segments; @Setup public void setup() { segments = IntStream.of(INDICES) .mapToObj(i -> MemorySegment.ofArray(new byte[i])) .toArray(MemorySegment[]::new); } @Benchmark public void heap_segment_fill() { for (int i = 0; i < SIZE; i++) { segments[i].fill((byte) 0); } } } This produces the following on my Mac M1: Benchmark Mode Cnt Score Error Units TestFill.heap_segment_fill avgt 30 59.054 ? 3.723 ns/op On average, an operation will take 59/16 = ~3 ns per operation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1731331461