Anton Vinogradov created IGNITE-28836:
-----------------------------------------

             Summary: DirectMessageWriter: reduce per-field overhead and 
per-message allocations on the message serialization hot path
                 Key: IGNITE-28836
                 URL: https://issues.apache.org/jira/browse/IGNITE-28836
             Project: Ignite
          Issue Type: Task
            Reporter: Anton Vinogradov
            Assignee: Anton Vinogradov


h3. Motivation

org.apache.ignite.internal.direct.DirectMessageWriter is on the critical path 
of every outgoing message: generated serializers call it field by field, and 
the NIO write loop re-enters it for every network buffer. Two inefficiencies 
show up there:

# Per-field stream resolution. Each of the ~32 writeXxx methods starts with 
\{{DirectByteBufferStream stream = state.item().stream;}}, i.e. stack[pos] 
(array load + bounds check) followed by a field load — re-evaluated on every 
primitive write. The current stream only changes on setBuffer / 
beforeNestedWrite / afterNestedWrite.
# Per-field allocations in the compressed path. writeCompressedMessage() 
allocates, for every compressed field, a fresh ByteBuffer.allocateDirect(10KB) 
plus a brand-new DirectMessageWriter (its own state stack + stream). The 
scratch buffer is only ever copied into a heap byte[] by 
CompressedMessage.compress() (via buf.get(...)) before deflating, so the direct 
allocation (native alloc + zeroing + Cleaner/GC reclamation) is pure overhead. 
Heavy exchange messages (GridDhtPartitionsSingleMessage / FullMessage) carry 
several compressed maps each, multiplying the cost during PME.

h3. Proposed changes

* Cache the current state item's stream in a curStream field; refresh it only 
in setBuffer, beforeNestedWrite, afterNestedWrite. All writeXxx methods use 
curStream instead of re-resolving state.item().stream.
* In writeCompressedMessage():
** use ByteBuffer.allocate() (heap) for the scratch buffer instead of 
allocateDirect();
** reuse a lazily-created, thread-confined tmpWriter (reset() before each use) 
instead of allocating a new writer per field — mirroring how the main writer is 
already reused across messages;
** grow the scratch buffer without the intermediate byte[] copy.

No wire-format change, no public API change.

h3. Benchmark (JMH, JDK 17, throughput; A/B baseline vs patched)

|| Benchmark || Baseline || Patched || Delta ||
| hotPathPrimitiveFields (1792 write calls/op) | ~551K ops/s | ~682K ops/s | 
+24% |
| compressed scratch acquire (direct+new -> heap+reuse) | 1.21M ops/s | 3.68M 
ops/s | x3.0 |
| compressed scratch: GC time | 1006 ms | 130 ms | x8 less |

The compressed path trades a little cheap young-gen heap churn for the 
elimination of off-heap / Cleaner direct-buffer churn, cutting total GC time 
~8x.

h3. Testing

* A JMH benchmark JmhDirectMessageWriterBenchmark is added under 
modules/benchmarks.
* Correctness verified by byte-for-byte writer->reader round-trips, identical 
between baseline and patched:
** primitives/arrays/String/UUID, 5000 records, 32-byte write buffer (thousands 
of setBuffer cycles);
** compressed map (4000 entries -> exercises scratch-buffer doubling; 16-byte 
chunks; second marshal reusing the writer -> exercises the tmpWriter.reset() 
branch).
* Existing DirectMarshallingMessagesTest covers the nested / compressed 
serialization paths.

h3. Compatibility / Risk

Behavior-preserving, no protocol change; safe to backport.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to