Anton Vinogradov created IGNITE-28836:
-----------------------------------------
Summary: DirectMessageWriter: reduce per-field overhead and
per-message allocations on the message serialization hot path
Key: IGNITE-28836
URL: https://issues.apache.org/jira/browse/IGNITE-28836
Project: Ignite
Issue Type: Task
Reporter: Anton Vinogradov
Assignee: Anton Vinogradov
h3. Motivation
org.apache.ignite.internal.direct.DirectMessageWriter is on the critical path
of every outgoing message: generated serializers call it field by field, and
the NIO write loop re-enters it for every network buffer. Two inefficiencies
show up there:
# Per-field stream resolution. Each of the ~32 writeXxx methods starts with
\{{DirectByteBufferStream stream = state.item().stream;}}, i.e. stack[pos]
(array load + bounds check) followed by a field load — re-evaluated on every
primitive write. The current stream only changes on setBuffer /
beforeNestedWrite / afterNestedWrite.
# Per-field allocations in the compressed path. writeCompressedMessage()
allocates, for every compressed field, a fresh ByteBuffer.allocateDirect(10KB)
plus a brand-new DirectMessageWriter (its own state stack + stream). The
scratch buffer is only ever copied into a heap byte[] by
CompressedMessage.compress() (via buf.get(...)) before deflating, so the direct
allocation (native alloc + zeroing + Cleaner/GC reclamation) is pure overhead.
Heavy exchange messages (GridDhtPartitionsSingleMessage / FullMessage) carry
several compressed maps each, multiplying the cost during PME.
h3. Proposed changes
* Cache the current state item's stream in a curStream field; refresh it only
in setBuffer, beforeNestedWrite, afterNestedWrite. All writeXxx methods use
curStream instead of re-resolving state.item().stream.
* In writeCompressedMessage():
** use ByteBuffer.allocate() (heap) for the scratch buffer instead of
allocateDirect();
** reuse a lazily-created, thread-confined tmpWriter (reset() before each use)
instead of allocating a new writer per field — mirroring how the main writer is
already reused across messages;
** grow the scratch buffer without the intermediate byte[] copy.
No wire-format change, no public API change.
h3. Benchmark (JMH, JDK 17, throughput; A/B baseline vs patched)
|| Benchmark || Baseline || Patched || Delta ||
| hotPathPrimitiveFields (1792 write calls/op) | ~551K ops/s | ~682K ops/s |
+24% |
| compressed scratch acquire (direct+new -> heap+reuse) | 1.21M ops/s | 3.68M
ops/s | x3.0 |
| compressed scratch: GC time | 1006 ms | 130 ms | x8 less |
The compressed path trades a little cheap young-gen heap churn for the
elimination of off-heap / Cleaner direct-buffer churn, cutting total GC time
~8x.
h3. Testing
* A JMH benchmark JmhDirectMessageWriterBenchmark is added under
modules/benchmarks.
* Correctness verified by byte-for-byte writer->reader round-trips, identical
between baseline and patched:
** primitives/arrays/String/UUID, 5000 records, 32-byte write buffer (thousands
of setBuffer cycles);
** compressed map (4000 entries -> exercises scratch-buffer doubling; 16-byte
chunks; second marshal reusing the writer -> exercises the tmpWriter.reset()
branch).
* Existing DirectMarshallingMessagesTest covers the nested / compressed
serialization paths.
h3. Compatibility / Risk
Behavior-preserving, no protocol change; safe to backport.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)