[jira] [Commented] (IMPALA-12657) Improve ProcessingCost of ScanNode and NonGroupingAggregator

ASF subversion and git services (Jira) Sat, 18 Jan 2025 11:45:12 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-12657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914376#comment-17914376
 ]


ASF subversion and git services commented on IMPALA-12657:
----------------------------------------------------------

Commit 3118e41c26f730a06d42994e678cab694c787649 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3118e41c2 ]

IMPALA-2945: Account for duplicate keys on multiple nodes preAgg

AggregationNode.computeStats() estimate cardinality under single node
assumption. This can be an underestimation in preaggregation node case
because same grouping key may exist in multiple nodes during
preaggreation.

This patch adjust the cardinality estimate using following model for the
number of distinct values in a random sample of k rows, previously used
to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644.

Assumes we are picking k rows from an infinite sample with ndv distinct
values, with the value uniformly distributed. The probability of a given
value not appearing in a sample, in that case is

((NDV - 1) / NDV) ^ k

This is because we are making k choices, and each of them has
(ndv - 1) / ndv chance of not being our value. Therefore the
probability of a given value appearing in the sample is:

1 - ((NDV - 1) / NDV) ^ k

And the number of distinct values in the sample is:

(1 - ((NDV - 1) / NDV) ^ k) * NDV

Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to
use the new estimation logic or not.

Testing:
- Pass core tests.

Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55
Reviewed-on: http://gerrit.cloudera.org:8080/22047
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Improve ProcessingCost of ScanNode and NonGroupingAggregator
> ------------------------------------------------------------
>
>                 Key: IMPALA-12657
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12657
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.3.0
>            Reporter: Riza Suminto
>            Assignee: David Rorke
>            Priority: Major
>             Fix For: Impala 4.4.0
>
>         Attachments: profile_1f4d7a679a3e12d5_4223115700000000.txt
>
>
> Several benchmark run measuring Impala scan performance indicates some 
> costing improvement opportunity around ScanNode and NonGroupingAggregator.
> [^profile_1f4d7a679a3e12d5_4223115700000000.txt] shows an example of simple 
> count query.
> Key takeaway:
>  # There is a strong correlation between total materialized bytes (row-size * 
> cardinality) with total materialized tuple time per fragment. Row 
> materialization cost should be adjusted to be based on this row-sized instead 
> of equal cost per scan range.
>  # NonGroupingAggregator should have much lower cost that GroupingAggregator. 
> In example above, the cost of NonGroupingAggregator dominates the scan 
> fragment even though it only does simple counting instead of hash table 
> operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12657) Improve ProcessingCost of ScanNode and NonGroupingAggregator

Reply via email to