[
https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Rozsa updated IMPALA-12861:
---------------------------------
Fix Version/s: Impala 4.5.0
> File formats are confused when Iceberg tables has mixed formats
> ---------------------------------------------------------------
>
> Key: IMPALA-12861
> URL: https://issues.apache.org/jira/browse/IMPALA-12861
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 4.3.0
> Reporter: Gabor Kaszab
> Assignee: Peter Rozsa
> Priority: Major
> Labels: impala-iceberg
> Fix For: Impala 4.5.0
>
> Attachments: multi_file_table_crash
>
>
> *Repro steps:*
> create table mixed_ice (i int, year int) partitioned by spec (year) stored as
> iceberg tblproperties('format-version'='2');
>
> 1) populate one partition with Impala (parquet)
> insert into mixed_ice values (1, 2024), (2, 2024);
>
> 2) change the write format:
> alter table mixed_ice set tblproperties ('write.format.default'='orc');
>
> 3) populate another partition with Hive (orc)
> insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
>
> 4) then query just the parquet partition:
> explain select * from mixed_ice where year = 2024;
> {code:java}
> | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB
> thread-reservation=1 |
> | PLAN-ROOT SINK
> |
> | | output exprs: default.mixed_ice.i, default.mixed_ice.year
> |
> | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
> thread-reservation=0 |
> | |
> |
> | 01:EXCHANGE [UNPARTITIONED]
> |
> | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
> |
> | tuple-ids=0 row-size=8B cardinality=2
> |
> | in pipelines: 00(GETNEXT)
> |
> |
> |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB
> thread-reservation=2 |
> | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]
> |
> | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0
> |
> | 00:SCAN HDFS [default.mixed_ice, RANDOM]
> |
> | HDFS partitions=1/1 files=1 size=602B
> |
> | Iceberg snapshot id: 4964066258730898133
> |
> | skipped Iceberg predicates: `year` = CAST(2024 AS INT)
> |
> | stored statistics:
> |
> | table: rows=5 size=945B
> |
> | columns: unavailable
> |
> | extrapolated-rows=disabled max-scan-range-rows=5
> |
> | file formats: [ORC, PARQUET]
> |
> | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1
> |
> | tuple-ids=0 row-size=8B cardinality=2
> |
> | in pipelines: 00(GETNEXT)
> |
> +------------------------------------------------------------------------------------------+
> {code}
> Note, the file formats: [ORC, PARQUET] part even though this query only
> reads a parquet files.
>
> *Some analyis:*
> When IcebergScanNode [is
> created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
> it holds the correct information about file formats (Parquet).
> Later on the parent class, HdfsScanNode also tries to populate the file
> formats [here|#L513].]
>
> It uses what
> [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
> returns. In this use case the 'sampledPartitions_' is null, so will return
> 'partitions_'
>
> Apparently, this 'partitions_' member holds the partition with the ORC file
> so it adds ORC to the fileFormats_. Unfortunately, this
> getSampledOrRawPartitions() is called in multiple locations within
> HdfsScanNode returning the wrong partition.
> *Next steps:*
> Check what other issues can this getSampledOrRawPartitions cause with multi
> file format tables. Also check if we can populate 'partitions_' properly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]