Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23760 )

Change subject: IMPALA-14612: Add global metrics for admission state map size
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/23760/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23760/3//COMMIT_MSG@16
PS3, Line 16: the removal l
> If the metric page is refreshed hundreds of time per second, FunctionGauge
I understand that manual updates at every call site make the code explicit. 
However, that creates a secondary state (the counter) that we must keep in sync 
with the real state (the map). If we miss one transition due to a bug, the 
metric suffers from drift that never self corrects. Since we are adding this 
metric specifically to catch leaks, we should not rely on the secondary state 
being bug free to detect bugs. Reading the map size directly (FunctionGauge) 
would ensure the metric always reflects the source of truth.

Regarding performance, the complexity in the callback is specifically to 
optimize performance. It uses a cheap metrics_lock_ to read the cached value. 
This path is fast, and only hits a bit heavier map Count() once more than 2 
(ADMISSION_MAP_SIZE_METRIC_UPDATE_INTERVAL_MS) seconds. It uses a way like 
eventually consistency for better performance. I ran a stress test with python 
concurrent clients calling get_metric_value(), and at a workload of 100 ops/sec 
level, it showed no performance impact.
If this metric later proves to be a performance hotspot, we can revisit the old 
approach. For now, using the source of truth is safer, and the performance 
impact looks minimal.


http://gerrit.cloudera.org:8080/#/c/23760/1/be/src/scheduling/admission-control-service.cc
File be/src/scheduling/admission-control-service.cc:

http://gerrit.cloudera.org:8080/#/c/23760/1/be/src/scheduling/admission-control-service.cc@318
PS1, Line 318:     AdmissionHeartbeatResponsePB* resp, kudu::rpc::RpcContext* 
rpc_context) {
> I understand the concern about metric value getting out of sync with map si
Some context is that admissiond is generally not a high frequency place, 
typical max queue sizes and max concurrent queries are hundreds, not millions. 
At this scale, sub-second, real-time accuracy is not critical.
Since the main purpose of this metric is to detect long term leaks, not short 
term fluctuations. So prioritizing reliability over real-time precision is 
acceptable.
FunctionGauge is pull-based. If the system is quiet, zero work is done. Even 
when the monitoring system pulls the metric periodically, the cost of Count() 
is negligible. The GenericShardedQueryMap uses shard spinlocks, so the lock is 
held for only nanoseconds. It won't impact normal operations like Get() in 
admissiond



--
To view, visit http://gerrit.cloudera.org:8080/23760
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie803aabf8d91b6381c5d0d7534cd9c9fc2166a73
Gerrit-Change-Number: 23760
Gerrit-PatchSet: 4
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Tue, 16 Dec 2025 14:19:51 +0000
Gerrit-HasComments: Yes

Reply via email to