[Impala-ASF-CR] IMPALA-14612: Add global metrics for admission state map size

Yida Wu (Code Review) Tue, 16 Dec 2025 06:20:11 -0800

Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23760 )

Change subject: IMPALA-14612: Add global metrics for admission state map size
......................................................................

Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/23760/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23760/3//COMMIT_MSG@16
PS3, Line 16: the removal l
> If the metric page is refreshed hundreds of time per second, FunctionGauge
I understand that manual updates at every call site make the code explicit.
However, that creates a secondary state (the counter) that we must keep in sync
with the real state (the map). If we miss one transition due to a bug, the
metric suffers from drift that never self corrects. Since we are adding this
metric specifically to catch leaks, we should not rely on the secondary state
being bug free to detect bugs. Reading the map size directly (FunctionGauge)
would ensure the metric always reflects the source of truth.

Regarding performance, the complexity in the callback is specifically to
optimize performance. It uses a cheap metrics_lock_ to read the cached value.
This path is fast, and only hits a bit heavier map Count() once more than 2
(ADMISSION_MAP_SIZE_METRIC_UPDATE_INTERVAL_MS) seconds. It uses a way like
eventually consistency for better performance. I ran a stress test with python
concurrent clients calling get_metric_value(), and at a workload of 100 ops/sec
level, it showed no performance impact.
If this metric later proves to be a performance hotspot, we can revisit the old
approach. For now, using the source of truth is safer, and the performance
impact looks minimal.

http://gerrit.cloudera.org:8080/#/c/23760/1/be/src/scheduling/admission-control-service.cc
File be/src/scheduling/admission-control-service.cc:

http://gerrit.cloudera.org:8080/#/c/23760/1/be/src/scheduling/admission-control-service.cc@318
PS1, Line 318: AdmissionHeartbeatResponsePB* resp, kudu::rpc::RpcContext*
rpc_context) {
> I understand the concern about metric value getting out of sync with map si
Some context is that admissiond is generally not a high frequency place,
typical max queue sizes and max concurrent queries are hundreds, not millions.
At this scale, sub-second, real-time accuracy is not critical.
Since the main purpose of this metric is to detect long term leaks, not short
term fluctuations. So prioritizing reliability over real-time precision is
acceptable.
FunctionGauge is pull-based. If the system is quiet, zero work is done. Even
when the monitoring system pulls the metric periodically, the cost of Count()
is negligible. The GenericShardedQueryMap uses shard spinlocks, so the lock is
held for only nanoseconds. It won't impact normal operations like Get() in
admissiond

--
To view, visit http://gerrit.cloudera.org:8080/23760
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie803aabf8d91b6381c5d0d7534cd9c9fc2166a73
Gerrit-Change-Number: 23760
Gerrit-PatchSet: 4
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Tue, 16 Dec 2025 14:19:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-14612: Add global metrics for admission state map size

Reply via email to