[
https://issues.apache.org/jira/browse/IMPALA-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898270#comment-17898270
]
Noemi Pap-Takacs commented on IMPALA-13122:
-------------------------------------------
[~stigahuang] I see that HdfsTable.FileMetadataStats class already aggregates
some of these metrics, like number of files, total file size and number of
blocks. Wouldn't it make sense to add lastAccessTime and some file size metrics
(maybe average is the most informative) and log them together?
> Show file stats in table loading logs
> -------------------------------------
>
> Key: IMPALA-13122
> URL: https://issues.apache.org/jira/browse/IMPALA-13122
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Noemi Pap-Takacs
> Priority: Major
> Labels: ramp-up
>
> Here is an example for table loading logs on a table:
> {noformat}
> I0603 08:46:05.555567 24417 HdfsTable.java:1255] Loading metadata for table
> definition and all partition(s) of tpcds.store_sales (needed by coordinator)
> I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS.
> Actual columns: 23
> I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List
> Done. Time taken: 26.699us
> I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata
> from the Metastore: tpcds.store_sales
> I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions
> for: tpcds.store_sales using partition batch size: 1000
> I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824
> partitions for table tpcds.store_sales
> I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824
> partitions for table tpcds.store_sales
> I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata
> from the Metastore: tpcds.store_sales
> I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file
> and block metadata for 1824 paths for table tpcds.store_sales using a thread
> pool of size 5
> I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block
> metadata for tpcds.store_sales partitions: ss_sold_date_sk=2450816,
> ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time
> taken: 569.107ms
> I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for
> table: tpcds.store_sales set to: -1
> I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for:
> tpcds.store_sales (4026ms){noformat}
> From the logs, we know the table has 23 columns and 1824 partitions. Time
> spent in loading the table schema and file metadata are also shown.
> However, it's unknown whether there are small files issue under the
> partitions. The underlying storage could also be slow (e.g. S3) which results
> in a long time in loading file metadata.
> It'd be helpful to add these in the logs:
> * number of files loaded
> * min/avg/max of file sizes
> * total file size
> * number of files
> * number of blocks (HDFS only)
> * number of hosts, disks (HDFS/Ozone only)
> * Stats of accessTime and lastModifiedTime
> These can be aggregated in FileMetadataLoader#loadInternal() and logged in
> ParallelFileMetadataLoader#load() or
> HdfsTable#loadFileMetadataForPartitions().
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]
> [https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]