[
https://issues.apache.org/jira/browse/IMPALA-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046716#comment-18046716
]
ASF subversion and git services commented on IMPALA-13756:
----------------------------------------------------------
Commit 85d77b908b12ae3d3f48ed5d49f38fb3832edc4e in impala's branch
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=85d77b908 ]
IMPALA-13756: Fix Iceberg V2 count(*) optimization for complex queries
We optimize plain count(*) queries on Iceberg tables the following way:
AGGREGATE
COUNT(*)
|
UNION ALL
/ \
/ \
/ \
SCAN all ANTI JOIN
datafiles / \
without / \
deletes SCAN SCAN
datafiles deletes
||
rewrite
||
\/
ArithmethicExpr: LHS + RHS
/ \
/ \
/ \
record_count AGGREGATE
of all COUNT(*)
datafiles |
without ANTI JOIN
deletes / \
/ \
SCAN SCAN
datafiles deletes
This optimization consists of two parts:
1 Rewriting count(*) expression to count(*) + "record_count" (of data
files without deletes)
2 In IcebergScanPlanner we only need to consruct the right side of
the original UNION ALL operator, i.e.:
ANTI JOIN
/ \
/ \
SCAN SCAN
datafiles deletes
SelectStmt decides whether we can do the count(*) optimization, and if
so, does the following:
1: SelectStmt sets 'TotalRecordsNumV2' in the analyzer, then during the
expression rewrite phase the CountStarToConstRule rewrites the
count(*) to count(*) + record_count
2: SelectStmt sets "OptimizeCountStarForIcebergV2" in the query context
then IcebergScanPlanner creates plan accordingly.
This mechanism works for simple queries, but can turn on count(*)
optimization in IcebergScanPlanner for all Iceberg V2 tables in complex
queries. Even if only one subquery enables count(*) optimization during
analysis.
With this patch the followings change:
1: We introduce IcebergV2CountStarAccumulator which we use instead of
the ArithmethicExpr. So after rewrite we still know if count(*)
optimization should be enabled for the planner.
2: Instead of using the query context, we pass the information to the
IcebergScanPlanner via the TableRef object.
Testing
* e2e tests
Change-Id: I1940031298eb634aa82c3d32bbbf16bce8eaf874
Reviewed-on: http://gerrit.cloudera.org:8080/23705
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Zoltan Borok-Nagy <[email protected]>
> Impala query returning wrong results
> ------------------------------------
>
> Key: IMPALA-13756
> URL: https://issues.apache.org/jira/browse/IMPALA-13756
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 5.0.0
> Reporter: vamshi kolanu
> Assignee: Zoltán Borók-Nagy
> Priority: Blocker
> Labels: correctness
> Attachments: repro.sql
>
>
> How to reproduce this issue.
> {code:java}
> create table merge_exclude_columns
> STORED BY ICEBERG
> as
> select CAST(1 AS INT) as id, 'hello' as msg, 'blue' as color
> union all
> select CAST(2 AS INT) as id, 'goodbye' as msg, 'green' as color
> union all
> select CAST(3 AS INT) as id, 'anyway' as msg, 'purple' as color;{code}
>
> {code:java}
> create table expected_merge_exclude_columns as select * from
> merge_exclude_columns; {code}
> This query is supposed to find the row count difference and records
> difference between both the tables. Since both the tables are same, we expect
> row_count_difference and num_mismatched to be 0 but currently num_mismatched
> is being computed as 3.
> {code:java}
> with diff_count as (
> SELECT
> 1 as id,
> COUNT(*) as num_missing FROM (
> (SELECT color, id, msg FROM expected_merge_exclude_columns EXCEPT
> SELECT color, id, msg FROM merge_exclude_columns)
> UNION ALL
> (SELECT color, id, msg FROM merge_exclude_columns EXCEPT
> SELECT color, id, msg FROM expected_merge_exclude_columns)
> ) as a
> ), table_a as (
> SELECT COUNT(*) as num_rows FROM expected_merge_exclude_columns
> ), table_b as (
> SELECT COUNT(*) as num_rows FROM merge_exclude_columns
> ), row_count_diff as (
> select
> 1 as id,
> table_a.num_rows - table_b.num_rows as difference
> from table_a, table_b
> )
> select
> row_count_diff.difference as row_count_difference,
> diff_count.num_missing as num_mismatched
> from row_count_diff
> join diff_count using (id);{code}
> Any help to resolve this issue is appreciated.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]