[ 
https://issues.apache.org/jira/browse/IMPALA-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947606#comment-17947606
 ] 

Quanlong Huang edited comment on IMPALA-13996 at 4/27/25 10:17 AM:
-------------------------------------------------------------------

Double checkd when built with TARGET_FILESYSTEM=hdfs and ERASURE_CODING=true, 
the input data file /test-warehouse/tpch.lineitem/lineitem.tbl has two block 
groups:
{noformat}
$ hadoop fsck /test-warehouse/tpch.lineitem/lineitem.tbl
Erasure Coded Block Groups:
 Total size:    753862072 B
 Total files:   1
 Total block groups (validated):        2 (avg. block group size 376931036 B)
 Minimally erasure-coded block groups:  2 (100.0 %)
 Over-erasure-coded block groups:       0 (0.0 %)
 Under-erasure-coded block groups:      0 (0.0 %)
 Unsatisfactory placement block groups: 0 (0.0 %)
 Average block group size:      5.0
 Missing block groups:          0
 Corrupt block groups:          0
 Missing internal blocks:       0 (0.0 %)
 Blocks queued for replication: 0
FSCK ended at Sun Apr 27 02:01:07 PDT 2025 in 1 milliseconds{noformat}
Note that this file is copied by a LOAD DATA statement in Hive:
{code:sql}
LOAD DATA LOCAL INPATH 
'{impala_home}/testdata/impala-data/{db_name}/{table_name}'
OVERWRITE INTO TABLE {db_name}{db_suffix}.{table_name};{code}
https://github.com/apache/impala/blob/74bd0832ed20aa0c2d1ef35428b2337b973cbcf4/testdata/datasets/tpch/tpch_schema_template.sql#L67-L68

So when reading from tpch.lineitem, there are only two fragment instances, thus 
tpch_parquet.lineitem has only two files in erasure coding builds.

We shouldn't rely on the number of files of tpch_parquet.lineitem in this test.


was (Author: stiga-huang):
Double checkd when built with TARGET_FILESYSTEM=hdfs and ERASURE_CODING=true, 
the input data file /test-warehouse/tpch.lineitem/lineitem.tbl has two block 
groups:
{noformat}
$ hadoop fsck /test-warehouse/tpch.lineitem/lineitem.tbl
Erasure Coded Block Groups:
 Total size:    753862072 B
 Total files:   1
 Total block groups (validated):        2 (avg. block group size 376931036 B)
 Minimally erasure-coded block groups:  2 (100.0 %)
 Over-erasure-coded block groups:       0 (0.0 %)
 Under-erasure-coded block groups:      0 (0.0 %)
 Unsatisfactory placement block groups: 0 (0.0 %)
 Average block group size:      5.0
 Missing block groups:          0
 Corrupt block groups:          0
 Missing internal blocks:       0 (0.0 %)
 Blocks queued for replication: 0
FSCK ended at Sun Apr 27 02:01:07 PDT 2025 in 1 milliseconds{noformat}
So when reading from tpch.lineitem, there are only two fragment instances, thus 
tpch_parquet.lineitem has only two files in erasure coding builds.

We shouldn't rely on the number of files of tpch_parquet.lineitem in this test.

> TestAllowIncompleteData.test_too_many_files fails erasure coding builds
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-13996
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13996
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Surya Hebbar
>            Assignee: Quanlong Huang
>            Priority: Major
>
> TestAllowIncompleteData.test_too_many_files fails erasure coding builds -
> Error -
> {code}
> assert "Too many files to collect in table tpch_parquet.lineitem: 3. Current 
> limit is 1 configured by startup flag 'catalog_partial_fetch_max_files'. 
> Consider compacting files of the table." in "Query 
> f74919e60b835567:da9967a400000000 failed:\nLocalCatalogException: Could not 
> load partitions for table tpch_parq...t limit is 1 configured by startup flag 
> 'catalog_partial_fetch_max_files'. Consider compacting files of the 
> table.\n\n" + where "Query f74919e60b835567:da9967a400000000 
> failed:\nLocalCatalogException: Could not load partitions for table 
> tpch_parq...t limit is 1 configured by startup flag 
> 'catalog_partial_fetch_max_files'. Consider compacting files of the 
> table.\n\n" = str(ImpalaBeeswaxException()){code}
>  
> Stacktrace -
> {code}
> custom_cluster/test_local_catalog.py:721: in test_too_many_files
>     assert err in str(exception)
> E   assert "Too many files to collect in table tpch_parquet.lineitem: 3. 
> Current limit is 1 configured by startup flag 
> 'catalog_partial_fetch_max_files'. Consider compacting files of the table." 
> in "Query f74919e60b835567:da9967a400000000 failed:\nLocalCatalogException: 
> Could not load partitions for table tpch_parq...t limit is 1 configured by 
> startup flag 'catalog_partial_fetch_max_files'. Consider compacting files of 
> the table.\n\n"
> E    +  where "Query f74919e60b835567:da9967a400000000 
> failed:\nLocalCatalogException: Could not load partitions for table 
> tpch_parq...t limit is 1 configured by startup flag 
> 'catalog_partial_fetch_max_files'. Consider compacting files of the 
> table.\n\n" = str(ImpalaBeeswaxException())
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to