[
https://issues.apache.org/jira/browse/IMPALA-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062539#comment-18062539
]
ASF subversion and git services commented on IMPALA-13122:
----------------------------------------------------------
Commit 31769a7fb50ae1d6b6d69d366a776df441e00e3a in impala's branch
refs/heads/master from Arnab Karmakar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=31769a7fb ]
IMPALA-13122: Add detailed file metadata statistics to table loading logs
This patch enhances table loading logs to include comprehensive file
metadata statistics, making it easier to identify small files issues
and diagnose slow storage performance.
The following statistics are now logged when loading file metadata:
- Number of files and blocks
- File sizes (min/avg/max)
- Total file size
- Modification times (min/max)
- Access times (min/max)
- Number of host:disk pairs (HDFS/Ozone only)
Example log output:
Loaded file and block metadata for functional.alltypes partitions:
year=2009/month=1, year=2009/month=10, year=2009/month=11, and 21
others. Time taken: 13.474ms. Files: 24, Blocks: 24, Total size:
478.45KB, File sizes (min/avg/max): 18.12KB/19.93KB/20.36KB,
Modification times (min/max): 2026-02-17 01:28:17/2026-02-17 01:28:21,
Access times (min/max): 2026-02-24 00:58:39/2026-02-24 00:58:39,
Hosts: 3, Host:Disk pairs: 3
Testing:
- Added Junit tests to verify statistics collection accuracy
- Added new python end-to-end tests covering various cases
Change-Id: I6f4592f173c047e5064058402f83be6d1f5c9a79
Reviewed-on: http://gerrit.cloudera.org:8080/23906
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Show file stats in table loading logs
> -------------------------------------
>
> Key: IMPALA-13122
> URL: https://issues.apache.org/jira/browse/IMPALA-13122
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Arnab Karmakar
> Priority: Major
> Labels: ramp-up
>
> Here is an example for table loading logs on a table:
> {noformat}
> I0603 08:46:05.555567 24417 HdfsTable.java:1255] Loading metadata for table
> definition and all partition(s) of tpcds.store_sales (needed by coordinator)
> I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS.
> Actual columns: 23
> I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List
> Done. Time taken: 26.699us
> I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata
> from the Metastore: tpcds.store_sales
> I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions
> for: tpcds.store_sales using partition batch size: 1000
> I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824
> partitions for table tpcds.store_sales
> I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824
> partitions for table tpcds.store_sales
> I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata
> from the Metastore: tpcds.store_sales
> I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file
> and block metadata for 1824 paths for table tpcds.store_sales using a thread
> pool of size 5
> I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block
> metadata for tpcds.store_sales partitions: ss_sold_date_sk=2450816,
> ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time
> taken: 569.107ms
> I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for
> table: tpcds.store_sales set to: -1
> I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for:
> tpcds.store_sales (4026ms){noformat}
> From the logs, we know the table has 23 columns and 1824 partitions. Time
> spent in loading the table schema and file metadata are also shown.
> However, it's unknown whether there are small files issue under the
> partitions. The underlying storage could also be slow (e.g. S3) which results
> in a long time in loading file metadata.
> It'd be helpful to add these in the logs:
> * number of files loaded
> * min/avg/max of file sizes
> * total file size
> * number of files
> * number of blocks (HDFS only)
> * number of hosts, disks (HDFS/Ozone only)
> * Stats of accessTime and lastModifiedTime
> These can be aggregated in FileMetadataLoader#loadInternal() and logged in
> ParallelFileMetadataLoader#load() or
> HdfsTable#loadFileMetadataForPartitions().
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]
> [https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]