chickenchickenlove commented on PR #22199:
URL: https://github.com/apache/kafka/pull/22199#issuecomment-4375889096

   @lianetm 
   Thanks for the time to take a look! 🙇‍♂️ 
   
   Thanks for taking a look.
   
   I don't think this necessarily means that the background thread is slow 
compared to the application thread.
   
   With low `max.poll.records`, if there are enough records already available 
in the fetch buffer, the application thread can reach 
`ConsumerUtils.getResult(inflightPoll.reconciliationCheckFuture(), timeoutMs)` 
very quickly after enqueueing the `AsyncPollEvent`.
   
   On the other hand, adding an `AsyncPollEvent` does not mean that the 
background thread processes it immediately. The background thread runs 
independently from the application thread, and it may be blocked in network 
poll or processing another part of its run loop before it gets to the 
application event queue. So even if processing the `AsyncPollEvent` itself is 
cheap, the application thread can still observe the reconciliation check future 
as incomplete when it reaches `collectFetch()`.
   
   The profiling results seem consistent with this interpretation:
   
   - In `before` and `optimized`, `ConsumerUtils.getResult` is not visible in 
the `collectFetch` path, and `parkNanos` / `unpark` are much lower.
   - In `after`, `ConsumerUtils.getResult`, `parkNanos`, `unpark`, 
`AsyncPollEvent`, and `markReconciliationCheckComplete` all increase.
   
   So my interpretation is that the extra CPU usage is not from reconciliation 
work itself being expensive, but from repeatedly going through the 
wait/future-completion path: the application thread parks while waiting for 
`reconciliationCheckFuture`, and the background thread completes that future 
and wakes it up.
   
   This patch does not reduce the number of `AsyncPollEvent`s. It still 
enqueues `AsyncPollEvent` as before, so I would not expect it to significantly 
reduce the `ConsumerNetworkThread` run loop itself. This is also consistent 
with the profile, where `ConsumerNetworkThread.runOnce` stays around the same 
level across the three runs:
   
   | Metric | Before | After | Optimized |
   |---|---:|---:|---:|
   | `ConsumerNetworkThread.runOnce` | 50.62% | 51.30% | 50.94% |
   
   The main difference is that this patch avoids waiting on 
`reconciliationCheckFuture` in steady state. In other words, `AsyncPollEvent` 
is still enqueued and processed, but its reconciliation-check completion is no 
longer on the application thread's critical path unless the member is actually 
in `RECONCILING` state.
   
   For more optimization, while we could consider a further optimization to 
only enqueue AsyncPollEvent when necessary, given the current profiling 
results, I believe the resulting performance improvement would be minimal. 🤔 
   
   Also, To confirm the assumptions, I can add temporary instrumentation around 
`AsyncPollEvent` latency. For each sampled `AsyncPollEvent`, we can record when 
it is enqueued by the application thread, when the application thread starts 
waiting on `reconciliationCheckFuture`, when the background thread starts 
processing it, and when `markReconciliationCheckComplete()` is called.
   
   This should show whether the application thread is reaching `collectFetch()` 
before the background thread has had a chance to process the `AsyncPollEvent`, 
or whether the delay is actually inside the background processing path.
   
   What do you think?
   Please let me know your thoughts. 🙇‍♂️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to