[
https://issues.apache.org/jira/browse/HUDI-9787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018187#comment-18018187
]
Timothy Brown commented on HUDI-9787:
-------------------------------------
This looks like it is already fixed in the master branch:
https://github.com/apache/hudi/pull/12125/files
> HoodieFileIndex file listing perf issue
> ---------------------------------------
>
> Key: HUDI-9787
> URL: https://issues.apache.org/jira/browse/HUDI-9787
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Reporter: Davis Zhang
> Priority: Major
> Fix For: 1.2.0
>
> Attachments: image-2025-09-04-10-23-53-374.png
>
>
> org.apache.hudi.HoodieFileIndex#filterFileSlices
>
> {code:java}
> val prunedPartitionsAndFilteredFileSlices = prunedPartitionsAndFileSlices.map
> {
> case (partitionOpt, fileSlices) =>
> // Filter in candidate files based on the col-stats or record level index
> lookup
> val candidateFileSlices: Seq[FileSlice] = {
> fileSlices.filter(fs => {
> val fileSliceFiles =
> fs.getLogFiles.map[String](JFunction.toJavaFunction[HoodieLogFile, String](lf
> => lf.getPath.getName))
> .collect(Collectors.toSet[String])
> val baseFileStatusOpt =
> getBaseFileStatus(Option.apply(fs.getBaseFile.orElse(null)))
> baseFileStatusOpt.exists(f => fileSliceFiles.add(f.getPath.getName))
> // NOTE: This predicate is true when {@code Option} is empty
> candidateFilesNamesOpt.forall(files => files.exists(elem =>
> fileSliceFiles.contains(elem)))
> })
> }
> totalFileSliceSize += fileSlices.size
> candidateFileSliceSize += candidateFileSlices.size
> (partitionOpt, candidateFileSlices)
> }
> val skippingRatio =
> if (!areAllFileSlicesCached) -1
> else if (getAllFiles().nonEmpty && totalFileSliceSize > 0)
> (totalFileSliceSize - candidateFileSliceSize) /
> totalFileSliceSize.toDouble
> else 0
> logInfo(s"Total file slices: $totalFileSliceSize; " +
> s"candidate file slices after data skipping: $candidateFileSliceSize; " +
> s"skipping percentage $skippingRatio")
> {code}
> This is doing a nested for loop like processing between 2 scala list
> candidateFilesNamesOpt and prunedPartitionsAndFileSlices to figure out the
> overlap.
>
> In production use case (0.14.1 based branch), we saw that
> !image-2025-09-04-10-23-53-374.png!
> It means both list are of size 55.8k. The total iteration the above for loop
> is 55.8k*55.8k which is not trivial.
> We noticed on a cluster with 32 GB memory with no CPU throttling (this logic
> is anyway single threaded), in this set up the hoodie file index choose to
> use
> COLUMN_STATS_INDEX_PROCESSING_MODE_IN_MEMORY which means everything is on
> driver.
>
> That nested for loop took ~10 min to complete. During this time, no spark
> tasks are running because this is driver only logic and there is no spark
> tasks available for execution. The performance impact could be more profound
> if the auto scaling logic of user's spark application choose to toss away
> executors and later scale them back. This churning would make the e2e spark
> application perf worse.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)