chickenchickenlove commented on PR #22199: URL: https://github.com/apache/kafka/pull/22199#issuecomment-4375889096
@lianetm Thanks for the time to take a look! 🙇♂️ Thanks for taking a look. I don't think this necessarily means that the background thread is slow compared to the application thread. With low `max.poll.records`, if there are enough records already available in the fetch buffer, the application thread can reach `ConsumerUtils.getResult(inflightPoll.reconciliationCheckFuture(), timeoutMs)` very quickly after enqueueing the `AsyncPollEvent`. On the other hand, adding an `AsyncPollEvent` does not mean that the background thread processes it immediately. The background thread runs independently from the application thread, and it may be blocked in network poll or processing another part of its run loop before it gets to the application event queue. So even if processing the `AsyncPollEvent` itself is cheap, the application thread can still observe the reconciliation check future as incomplete when it reaches `collectFetch()`. The profiling results seem consistent with this interpretation: - In `before` and `optimized`, `ConsumerUtils.getResult` is not visible in the `collectFetch` path, and `parkNanos` / `unpark` are much lower. - In `after`, `ConsumerUtils.getResult`, `parkNanos`, `unpark`, `AsyncPollEvent`, and `markReconciliationCheckComplete` all increase. So my interpretation is that the extra CPU usage is not from reconciliation work itself being expensive, but from repeatedly going through the wait/future-completion path: the application thread parks while waiting for `reconciliationCheckFuture`, and the background thread completes that future and wakes it up. This patch does not reduce the number of `AsyncPollEvent`s. It still enqueues `AsyncPollEvent` as before, so I would not expect it to significantly reduce the `ConsumerNetworkThread` run loop itself. This is also consistent with the profile, where `ConsumerNetworkThread.runOnce` stays around the same level across the three runs: | Metric | Before | After | Optimized | |---|---:|---:|---:| | `ConsumerNetworkThread.runOnce` | 50.62% | 51.30% | 50.94% | The main difference is that this patch avoids waiting on `reconciliationCheckFuture` in steady state. In other words, `AsyncPollEvent` is still enqueued and processed, but its reconciliation-check completion is no longer on the application thread's critical path unless the member is actually in `RECONCILING` state. For more optimization, while we could consider a further optimization to only enqueue AsyncPollEvent when necessary, given the current profiling results, I believe the resulting performance improvement would be minimal. 🤔 Also, To confirm the assumptions, I can add temporary instrumentation around `AsyncPollEvent` latency. For each sampled `AsyncPollEvent`, we can record when it is enqueued by the application thread, when the application thread starts waiting on `reconciliationCheckFuture`, when the background thread starts processing it, and when `markReconciliationCheckComplete()` is called. This should show whether the application thread is reaching `collectFetch()` before the background thread has had a chance to process the `AsyncPollEvent`, or whether the delay is actually inside the background processing path. What do you think? Please let me know your thoughts. 🙇♂️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
