[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463531#comment-17463531 ]
sivabalan narayanan commented on HUDI-3066: ------------------------------------------- key length > 0 : is your dataset non-partitioned by any chance ? > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > ------------------------------------------------------------------------------------ > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 > Reporter: Harsha Teja Kanna > Assignee: Manoj Govindassamy > Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, > metadata_timeline_archived.txt, metadata_timeline_compacted.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519', > fileLen=0} > 2021-12-18 23:37:46,122 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,122 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663 > at instant 20211217090337935 > 2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,127 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,127 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,141 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795', > fileLen=0} > 2021-12-18 23:37:46,143 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519', > fileLen=0} > 2021-12-18 23:37:46,143 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362 > at instant 20211217045206861 > 2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590 > at instant 20211217035058289 > 2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795', > fileLen=0} > 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519 > at instant 20211217022306173 > 2021-12-18 23:37:46,162 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,170 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.87_0-35-433', > fileLen=0} > 2021-12-18 23:37:46,171 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.78_0-56-705', > fileLen=0} > 2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795 > at instant 20211216184106961 > 2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge > 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of log > files scanned => 437 > 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: > MaxMemoryInBytes allowed for compaction => 1073741824 > 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of > entries in MemoryBasedMap in ExternalSpillableMap => 165 > 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Total size in > bytes of MemoryBasedMap in ExternalSpillableMap => 259380 > 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of > entries in BitCaskDiskMap in ExternalSpillableMap => 0 > 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Size of file > spilled to disk => 0 > 2021-12-18 23:38:35,380 INFO metadata.HoodieBackedTableMetadata: Opened 437 > metadata log files (dataset instant=20211218233649435, metadata > instant=20211218233649435) in 22935 ms > 2021-12-18 23:38:37,193 INFO metadata.HoodieBackedTableMetadata: Opened 437 > metadata log files (dataset instant=20211218233649435, metadata > instant=20211218233649435) in 22802 ms > Sample file listing tasks > !Screen Shot 2021-12-18 at 6.16.29 PM.png! > -- This message was sent by Atlassian Jira (v8.20.1#820001)