[ 
https://issues.apache.org/jira/browse/IMPALA-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062790#comment-18062790
 ] 

Michael Smith edited comment on IMPALA-13122 at 3/4/26 6:25 PM:
----------------------------------------------------------------

[~arnabk1108] I see 
custom_cluster.test_file_metadata_stats.TestFileMetadataStats.test_file_metadata_stats_host_disk_pairs
 failing when run with erasure coding enabled. Please take a look. To enable 
erasure coding
{code:java}
export ERASURE_CODING=true
./buildall.sh -notests -format -start_minicluster -start_impala_cluster
create_testdata.sh
load-data.py --workloads functional-query --table_format text/none --table_name 
alltypessmall{code}
to rebuild HDFS with erasure coding and load necessary testdata.
h3. Error Message
{code:java}
AssertionError: Expected at least one line in file 
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
 matching regex 'Hosts: \d+', but found none. {code}
h3. Stacktrace
{code:java}
custom_cluster/test_file_metadata_stats.py:130: in 
test_file_metadata_stats_host_disk_pairs
    self.assert_catalogd_log_contains("INFO", hosts_regex, expected_count=-1,
        hosts_regex = 'Hosts: \\d+'
        self       = 
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at 
0x7f3c14167710>
        tbl_name   = 'functional.alltypessmall'
common/impala_test_suite.py:1724: in assert_catalogd_log_contains
    return self.assert_log_contains(
        daemon     = 'catalogd'
        dry_run    = False
        expected_count = -1
        level      = 'INFO'
        line_regex = 'Hosts: \\d+'
        node_index = 0
        self       = 
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at 
0x7f3c14167710>
        timeout_s  = 15
common/impala_test_suite.py:1802: in assert_log_contains
    assert found > 0, "Expected at least one line in file %s matching regex 
'%s'"\
E   AssertionError: Expected at least one line in file 
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
 matching regex 'Hosts: \d+', but found none.
        daemon     = 'catalogd'
        dry_run    = False
        expected_count = -1
        found      = 0
        last_re_result = None
        level      = 'INFO'
        line       = 'I20260303 21:38:19.042086 1092539 catalog-server.cc:790] 
A catalog update with 6 entries is assembled. Catalog version: 2140 Last sent 
catalog version: 2139\n'
        line_regex = 'Hosts: \\d+'
        log_file   = <_io.BufferedReader 
name='/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_clus...tats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'>
        log_file_path = 
'/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'
        pattern    = re.compile('Hosts: \\d+')
        re_result  = None
        self       = 
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at 
0x7f3c14167710>
        start_time = 1772602699.0046756
        timeout_s  = 15 {code}


was (Author: JIRAUSER288956):
[~arnabk1108] I see 
custom_cluster.test_file_metadata_stats.TestFileMetadataStats.test_file_metadata_stats_host_disk_pairs
 failing when run with erasure coding enabled. Please take a look. To enable 
erasure coding
{code:java}
export ERASURE_CODING=true
./buildall.sh -notests -format -start_minicluster -start_impala_cluster
create_testdata.sh
load-data.py --workloads functional-query --table_format text/none --table_name 
alltypessmall{code}
to rebuild HDFS with erasure coding and load necessary testdata.
h3. Error Message

AssertionError: Expected at least one line in file 
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
 matching regex 'Hosts: \d+', but found none.
h3. Stacktrace

custom_cluster/test_file_metadata_stats.py:130: in 
test_file_metadata_stats_host_disk_pairs 
self.assert_catalogd_log_contains("INFO", hosts_regex, expected_count=-1, 
hosts_regex = 'Hosts: \\d+' self = 
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at 
0x7f3c14167710> tbl_name = 'functional.alltypessmall' 
common/impala_test_suite.py:1724: in assert_catalogd_log_contains return 
self.assert_log_contains( daemon = 'catalogd' dry_run = False expected_count = 
-1 level = 'INFO' line_regex = 'Hosts: \\d+' node_index = 0 self = 
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at 
0x7f3c14167710> timeout_s = 15 common/impala_test_suite.py:1802: in 
assert_log_contains assert found > 0, "Expected at least one line in file %s 
matching regex '%s'"\ E AssertionError: Expected at least one line in file 
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
 matching regex 'Hosts: \d+', but found none. daemon = 'catalogd' dry_run = 
False expected_count = -1 found = 0 last_re_result = None level = 'INFO' line = 
'I20260303 21:38:19.042086 1092539 catalog-server.cc:790] A catalog update with 
6 entries is assembled. Catalog version: 2140 Last sent catalog version: 
2139\n' line_regex = 'Hosts: \\d+' log_file = <_io.BufferedReader 
name='/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_clus...tats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'>
 log_file_path = 
'/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'
 pattern = re.compile('Hosts: \\d+') re_result = None self = 
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at 
0x7f3c14167710> start_time = 1772602699.0046756 timeout_s = 15

> Show file stats in table loading logs
> -------------------------------------
>
>                 Key: IMPALA-13122
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13122
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Arnab Karmakar
>            Priority: Major
>              Labels: ramp-up
>             Fix For: Impala 5.0.0
>
>
> Here is an example for table loading logs on a table:
> {noformat}
> I0603 08:46:05.555567 24417 HdfsTable.java:1255] Loading metadata for table 
> definition and all partition(s) of tpcds.store_sales (needed by coordinator)
> I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS. 
> Actual columns: 23
> I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List 
> Done. Time taken: 26.699us
> I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata 
> from the Metastore: tpcds.store_sales
> I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions 
> for: tpcds.store_sales using partition batch size: 1000 
> I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824 
> partitions for table tpcds.store_sales
> I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824 
> partitions for table tpcds.store_sales
> I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata 
> from the Metastore: tpcds.store_sales
> I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file 
> and block metadata for 1824 paths for table tpcds.store_sales using a thread 
> pool of size 5
> I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block 
> metadata for tpcds.store_sales partitions: ss_sold_date_sk=2450816, 
> ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time 
> taken: 569.107ms
> I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for 
> table: tpcds.store_sales set to: -1
> I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for: 
> tpcds.store_sales (4026ms){noformat}
> From the logs, we know the table has 23 columns and 1824 partitions. Time 
> spent in loading the table schema and file metadata are also shown.
> However, it's unknown whether there are small files issue under the 
> partitions. The underlying storage could also be slow (e.g. S3) which results 
> in a long time in loading file metadata.
> It'd be helpful to add these in the logs:
>  * number of files loaded
>  * min/avg/max of file sizes
>  * total file size
>  * number of files
>  * number of blocks (HDFS only)
>  * number of hosts, disks (HDFS/Ozone only)
>  * Stats of accessTime and lastModifiedTime
> These can be aggregated in FileMetadataLoader#loadInternal() and logged in 
> ParallelFileMetadataLoader#load() or 
> HdfsTable#loadFileMetadataForPartitions().
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]
> [https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to