[jira] [Commented] (IMPALA-13644) Generalize and move getPerInstanceNdvForCpuCosting into AggregationNode.

ASF subversion and git services (Jira) Sat, 18 Jan 2025 11:45:05 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914370#comment-17914370
 ]


ASF subversion and git services commented on IMPALA-13644:
----------------------------------------------------------

Commit c298c542621cb58ffe0772bf29ebdf7316cb77d1 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c298c5426 ]

IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting

getPerInstanceNdvForCpuCosting is a method to estimate the number of
distinct values of exprs per fragment instance when accounting for the
likelihood of duplicate keys across fragment instances. It borrows the
probabilistic model described in IMPALA-2945. This method is exclusively
used by AggregationNode only.

getPerInstanceNdvForCpuCosting run the probabilistic formula
individually for each grouping expression and then multiply it together.
That match with how we estimate group NDV in the past where we simply do
NDV multiplication of each grouping expression.

Recently, we adds tuple-based analysis to lower cardinality estimate for
all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086).
All of the bounding happens in AggregationNode.computeStats(), where we
call estimateNumGroups() function that returns globalNdv estimate for
specific aggregation class.

To take advantage from that more precise globalNdv, this patch replace
getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that
apply the probabilistic formula over this single globalNdv number rather
than the old way where it often return an overestimated number from NDV
multiplication method. Its use is still limited only to calculate
ProcessingCost. Using it for preagg output cardinality will be done by
IMPALA-2945.

estimatePreaggCardinality is skipped if data partition of input is a
subset of grouping expression.

Testing:
- Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True.
  ProcessingCost changes, but all cardinality number stays.
- Add CardinalityTest#testEstimatePreaggCardinality.
- Update test_executor_groups.py. Enable v2 profile as well for easier
  runtime profile debugging.

Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1
Reviewed-on: http://gerrit.cloudera.org:8080/22320
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Generalize and move getPerInstanceNdvForCpuCosting into AggregationNode.
> ------------------------------------------------------------------------
>
>                 Key: IMPALA-13644
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13644
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.4.0
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>             Fix For: Impala 4.5.0
>
>
> getPerInstanceNdvForCpuCosting is a method to estimate the number of distinct 
> values of exprs per fragment instance when accounting for the likelihood of 
> duplicate keys across fragment instances. It borrows the probabilistic model 
> from formula described in IMPALA-2945. This method is exclusively used by 
> AggregationNode only.
> [https://github.com/apache/impala/blob/99529db6ad62ddc34cbfd924d7e41b1fce5b60a2/fe/src/main/java/org/apache/impala/planner/PlanFragment.java#L634-L642]
>  
> We should move this method to AggregationNode and generalize it to accept NDV 
> estimate calculated at AggregationNode.computeStats() as input. The number 
> from computeStats should be more precise now after improvement from 
> IMPALA-13405, IMPALA-13526, and IMPALA-13622.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13644) Generalize and move getPerInstanceNdvForCpuCosting into AggregationNode.

Reply via email to