Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22926 )

Change subject: IMPALA-14574: Lower memory estimate by analyzing 
PipelineMembership
......................................................................


Patch Set 15:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/22926/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22926/14//COMMIT_MSG@92
PS14, Line 92:   scale 10. No major regression is observed.
> Were there any improvements to note? I guess perf tests run one at a time,
I thought no spill happen before and after the patch. But I check again just 
now and found out that query 28 and 78 spill after the patch.

Query 28 reduce Cluster Memory Admitted (across all executors) from 98.04GB to 
34.04GB, but latency improve by 6.06% (1575ms faster).
Query 78 reduce Cluster Memory Admitted from 468.81GB to 289.46GB, and latency 
regress by 8.75% (12261ms slower).

All other queries does not spill after this patch.

I have not run concurrency tests. However, if enough concurrent queries spill 
to disk, we may have risk of running out of scratch space. TOTAL estimation 
mode should be more resilient against scratch space exhaustion since higher 
estimate will cause admission control to stop admitting new query earlier.


http://gerrit.cloudera.org:8080/#/c/22926/14/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/22926/14/be/src/service/query-options.cc@1437
PS14, Line 1437:         RETURN_IF_ERROR(GetThriftEnum(value, "Memory estimate 
mode",
> nit: it might be better to stick with "Memory estimate mode" here. Consiste
Done


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PipelineBasedMemEstimator.java
File fe/src/main/java/org/apache/impala/planner/PipelineBasedMemEstimator.java:

http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PipelineBasedMemEstimator.java@54
PS14, Line 54:     totalMaxAdjacentOf_ = initializedMemMap(allFragments);
> typo: getPerInstanceMinMemOverestimation?
Done


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PipelineBasedMemEstimator.java@60
PS14, Line 60:
> Why does the order of allFragments matter?
The map value, PipeMem object, memorize the originating fragment where the 
pipeline start. This is why we traverse bottom-up.
Added comment here.


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PipelineBasedMemEstimator.java@104
PS14, Line 104:    * 2). Sum of all maximum memory growth estimate rooted at 
OPEN pipelines adjacent to X.
> I'd add trace logging before the loop that indicates something about the pl
Added in constructor.


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PipelineBasedMemEstimator.java@167
PS14, Line 167:             openPipelines.add(pm);
> Are GETNEXT and OPEN really the only two options here? What guarantees that
PipelineMembership class always initialized with either GETNEXT or OPEN phase.
Added Precondition in PipelineMembership constructor to ensure this.


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PipelineBasedMemEstimator.java@219
PS14, Line 219:             PlanNodeId openPipeId = openPipe.getId();
> This is pretty deeply nested. Consider breaking some of this function into
Done


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PlanFragment.java
File fe/src/main/java/org/apache/impala/planner/PlanFragment.java:

http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/PlanFragment.java@470
PS14, Line 470:     // There are several PlanNodes that do not reserve initial 
memory (0), but set
> We could also fix this by ensuring every node sets an initial memory reserv
Yes. It is done this way to avoid underestimation when we do memory overlapping 
loop.
Some PlanNode / DataSink do require initial memory just to start, such as 4MB 
buffer in PlanRootSink to enable result spooling. Other PlanNode does not 
really consume memory rightaway when they start and thus does not declare any 
initial memory.

Specific for ExchangeNode, there are open Jira that describe current memory 
estimation issue with it:
https://issues.apache.org/jira/browse/IMPALA-12594
https://issues.apache.org/jira/browse/IMPALA-11510


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/Planner.java@714
PS14, Line 714:       long reduction =
> What does this look like in-context? It might be nice to provide a human-re
Changed this log line slightly. It will shows up like this:

I20260105 22:23:44.934326 2029861 Planner.java:716] Using MAX_PIPELINE 
estimate: maxPerHostEstimateBytes=43.31GB, originalMemEstimateBytes=68.61GB, 
minMemReservationBytes=6.10GB, reduction=25.30GB


http://gerrit.cloudera.org:8080/#/c/22926/14/fe/src/main/java/org/apache/impala/planner/Planner.java@1097
PS14, Line 1097:
> We provide a warning about incomplete stats elsewhere. I'm surprised this h
That code exist in Frontend.java
https://github.com/apache/impala/blob/e2b0a027b7a9c5ccb127ca318f273ea1f18f127c/fe/src/main/java/org/apache/impala/service/Frontend.java#L2014-L2015
Which happen in later phase of query compilation.

This code is here invoked earlier to avoid using pipeline-based memory 
estimation if accurate estimation can not be done due to missing stats.


http://gerrit.cloudera.org:8080/#/c/22926/14/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23a.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23a.test:

http://gerrit.cloudera.org:8080/#/c/22926/14/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23a.test@513
PS14, Line 513: Per-Host Resource Estimates: Memory=41.59GB
> This is a substantial change. Nice sign that it could be beneficial in some
The max Per Node Peak Memory Usage before and after patch is around 24 GB.
This is the actual profile from 10 node, MT_DOP=12, 3TB setup:

Baseline (before patch)
Max Per-Host Resource Reservation: Memory=8.99GB Threads=279
Per-Host Resource Estimates: Memory=120.90GB

    Per Node Peak Memory Usage: 
coordinator-0.coordinator-int.impala-1764109762-62np.svc.cluster.local:27000(2.01
 MB) 
impala-executor-000-5.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(22.53
 GB) 
impala-executor-000-3.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.31
 GB) 
impala-executor-000-8.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.56
 GB) 
impala-executor-000-4.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.95
 GB) 
impala-executor-000-7.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(22.74
 GB) 
impala-executor-000-0.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.38
 GB) 
impala-executor-000-9.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.26
 GB) 
impala-executor-000-6.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.15
 GB) 
impala-executor-000-2.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.49
 GB) 
impala-executor-000-1.impala-executor.impala-1764109762-62np.svc.cluster.local:27010(23.77
 GB)


After patch:
Max Per-Host Resource Reservation: Memory=8.99GB Threads=279
Per-Host Resource Estimates: Memory=43.66GB

    Per Node Peak Memory Usage: 
coordinator-0.coordinator-int.impala-1761163328-fw8c.svc.cluster.local:27000(4.01
 MB) 
impala-executor-000-3.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(23.72
 GB) 
impala-executor-000-1.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(23.41
 GB) 
impala-executor-000-4.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(23.53
 GB) 
impala-executor-000-8.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(23.40
 GB) 
impala-executor-000-0.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(23.65
 GB) 
impala-executor-000-9.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(22.79
 GB) 
impala-executor-000-5.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(23.54
 GB) 
impala-executor-000-6.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(23.68
 GB) 
impala-executor-000-7.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(22.66
 GB) 
impala-executor-000-2.impala-executor.impala-1761163328-fw8c.svc.cluster.local:27010(24.11
 GB)



--
To view, visit http://gerrit.cloudera.org:8080/22926
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iba4f40993d8360bfdc1eb592a626e285dfed1627
Gerrit-Change-Number: 22926
Gerrit-PatchSet: 15
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Tue, 06 Jan 2026 06:52:56 +0000
Gerrit-HasComments: Yes

Reply via email to