Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/23760 )
Change subject: IMPALA-14612: Add global metrics for admission state map size ...................................................................... Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/23760/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23760/3//COMMIT_MSG@16 PS3, Line 16: the removal l > If the metric page is refreshed hundreds of time per second, FunctionGauge I understand that manual updates at every call site make the code explicit. However, that creates a secondary state (the counter) that we must keep in sync with the real state (the map). If we miss one transition due to a bug, the metric suffers from drift that never self corrects. Since we are adding this metric specifically to catch leaks, we should not rely on the secondary state being bug free to detect bugs. Reading the map size directly (FunctionGauge) would ensure the metric always reflects the source of truth. Regarding performance, the complexity in the callback is specifically to optimize performance. It uses a cheap metrics_lock_ to read the cached value. This path is fast, and only hits a bit heavier map Count() once more than 2 (ADMISSION_MAP_SIZE_METRIC_UPDATE_INTERVAL_MS) seconds. It uses a way like eventually consistency for better performance. I ran a stress test with python concurrent clients calling get_metric_value(), and at a workload of 100 ops/sec level, it showed no performance impact. If this metric later proves to be a performance hotspot, we can revisit the old approach. For now, using the source of truth is safer, and the performance impact looks minimal. http://gerrit.cloudera.org:8080/#/c/23760/1/be/src/scheduling/admission-control-service.cc File be/src/scheduling/admission-control-service.cc: http://gerrit.cloudera.org:8080/#/c/23760/1/be/src/scheduling/admission-control-service.cc@318 PS1, Line 318: AdmissionHeartbeatResponsePB* resp, kudu::rpc::RpcContext* rpc_context) { > I understand the concern about metric value getting out of sync with map si Some context is that admissiond is generally not a high frequency place, typical max queue sizes and max concurrent queries are hundreds, not millions. At this scale, sub-second, real-time accuracy is not critical. Since the main purpose of this metric is to detect long term leaks, not short term fluctuations. So prioritizing reliability over real-time precision is acceptable. FunctionGauge is pull-based. If the system is quiet, zero work is done. Even when the monitoring system pulls the metric periodically, the cost of Count() is negligible. The GenericShardedQueryMap uses shard spinlocks, so the lock is held for only nanoseconds. It won't impact normal operations like Get() in admissiond -- To view, visit http://gerrit.cloudera.org:8080/23760 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie803aabf8d91b6381c5d0d7534cd9c9fc2166a73 Gerrit-Change-Number: 23760 Gerrit-PatchSet: 4 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Tue, 16 Dec 2025 14:19:51 +0000 Gerrit-HasComments: Yes
