[
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470802#comment-17470802
]
Harsha Teja Kanna commented on HUDI-3066:
-----------------------------------------
Hi
I am working on re-testing this. Seeing other un-related issues with partition
discovery schema inference without wildcard in the base path.
Will post my result soon.
Thanks.
> Very slow file listing after enabling metadata for existing tables in 0.10.0
> release
> ------------------------------------------------------------------------------------
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
> Issue Type: Bug
> Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
> Reporter: Harsha Teja Kanna
> Assignee: sivabalan narayanan
> Priority: Critical
> Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png,
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12
> PM.png, metadata_files.txt, metadata_files_compacted.txt,
> metadata_timeline.txt, metadata_timeline_archived.txt,
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt,
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(),
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(),
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms
> to read 136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.76_0-20-515
> at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next
> reader for logfile
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663',
> fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log
> file
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613',
> fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.62_0-34-377
> at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next
> reader for logfile
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362',
> fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log
> file
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663',
> fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next
> reader for logfile
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590',
> fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613
> at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next
> reader for logfile
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519',
> fileLen=0}
> 2021-12-18 23:37:46,122 INFO log.AbstractHoodieLogRecordReader: Scanning log
> file
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362',
> fileLen=0}
> 2021-12-18 23:37:46,122 INFO s3a.S3AInputStream: Switching to Random IO seek
> policy
> 2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663
> at instant 20211217090337935
> 2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,127 INFO log.AbstractHoodieLogRecordReader: Scanning log
> file
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590',
> fileLen=0}
> 2021-12-18 23:37:46,127 INFO s3a.S3AInputStream: Switching to Random IO seek
> policy
> 2021-12-18 23:37:46,141 INFO log.HoodieLogFormatReader: Moving to the next
> reader for logfile
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795',
> fileLen=0}
> 2021-12-18 23:37:46,143 INFO log.AbstractHoodieLogRecordReader: Scanning log
> file
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519',
> fileLen=0}
> 2021-12-18 23:37:46,143 INFO s3a.S3AInputStream: Switching to Random IO seek
> policy
> 2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362
> at instant 20211217045206861
> 2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590
> at instant 20211217035058289
> 2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Scanning log
> file
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795',
> fileLen=0}
> 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519
> at instant 20211217022306173
> 2021-12-18 23:37:46,162 INFO s3a.S3AInputStream: Switching to Random IO seek
> policy
> 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,170 INFO log.HoodieLogFormatReader: Moving to the next
> reader for logfile
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.87_0-35-433',
> fileLen=0}
> 2021-12-18 23:37:46,171 INFO log.HoodieLogFormatReader: Moving to the next
> reader for logfile
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.78_0-56-705',
> fileLen=0}
> 2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Reading a
> data block from file
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795
> at instant 20211216184106961
> 2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Number of
> remaining logblocks to merge
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of log
> files scanned => 437
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner:
> MaxMemoryInBytes allowed for compaction => 1073741824
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of
> entries in MemoryBasedMap in ExternalSpillableMap => 165
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Total size in
> bytes of MemoryBasedMap in ExternalSpillableMap => 259380
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of
> entries in BitCaskDiskMap in ExternalSpillableMap => 0
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Size of file
> spilled to disk => 0
> 2021-12-18 23:38:35,380 INFO metadata.HoodieBackedTableMetadata: Opened 437
> metadata log files (dataset instant=20211218233649435, metadata
> instant=20211218233649435) in 22935 ms
> 2021-12-18 23:38:37,193 INFO metadata.HoodieBackedTableMetadata: Opened 437
> metadata log files (dataset instant=20211218233649435, metadata
> instant=20211218233649435) in 22802 ms
> Sample file listing tasks
> !Screen Shot 2021-12-18 at 6.16.29 PM.png!
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)