Davis Zhang created HUDI-9787:
---------------------------------

             Summary: HoodieFileIndex file listing perf issue
                 Key: HUDI-9787
                 URL: https://issues.apache.org/jira/browse/HUDI-9787
             Project: Apache Hudi
          Issue Type: Bug
          Components: index
            Reporter: Davis Zhang
             Fix For: 1.2.0
         Attachments: image-2025-09-04-10-23-53-374.png

org.apache.hudi.HoodieFileIndex#filterFileSlices

 
{code:java}
val prunedPartitionsAndFilteredFileSlices = prunedPartitionsAndFileSlices.map {
  case (partitionOpt, fileSlices) =>
    // Filter in candidate files based on the col-stats or record level index 
lookup
    val candidateFileSlices: Seq[FileSlice] = {
      fileSlices.filter(fs => {
        val fileSliceFiles = 
fs.getLogFiles.map[String](JFunction.toJavaFunction[HoodieLogFile, String](lf 
=> lf.getPath.getName))
          .collect(Collectors.toSet[String])
        val baseFileStatusOpt = 
getBaseFileStatus(Option.apply(fs.getBaseFile.orElse(null)))
        baseFileStatusOpt.exists(f => fileSliceFiles.add(f.getPath.getName))
        // NOTE: This predicate is true when {@code Option} is empty         
candidateFilesNamesOpt.forall(files => files.exists(elem => 
fileSliceFiles.contains(elem)))
      })
    }

    totalFileSliceSize += fileSlices.size
    candidateFileSliceSize += candidateFileSlices.size
    (partitionOpt, candidateFileSlices)
}

val skippingRatio =
  if (!areAllFileSlicesCached) -1
  else if (getAllFiles().nonEmpty && totalFileSliceSize > 0)
    (totalFileSliceSize - candidateFileSliceSize) / totalFileSliceSize.toDouble
  else 0

logInfo(s"Total file slices: $totalFileSliceSize; " +
  s"candidate file slices after data skipping: $candidateFileSliceSize; " +
  s"skipping percentage $skippingRatio")
 {code}
This is doing a nested for loop like processing between 2 scala list 
candidateFilesNamesOpt and prunedPartitionsAndFileSlices to figure out the 
overlap.

 

In production use case (0.14.1 based branch), we saw that 
!image-2025-09-04-10-23-53-374.png!

It means both list are of size 55.8k. The total iteration the above for loop is 
55.8k*55.8k which is not trivial.

We noticed on a cluster with 32 GB memory with no CPU throttling (this logic is 
anyway single threaded), in this set up the hoodie file index choose to use 
COLUMN_STATS_INDEX_PROCESSING_MODE_IN_MEMORY which means everything is on 
driver.
 
That nested for loop took ~10 min to complete. During this time, no spark tasks 
are running because this is driver only logic and there is no spark tasks 
available for execution. The performance impact could be more profound if the 
auto scaling logic of user's spark application choose to toss away executors 
and later scale them back. This churning would make the e2e spark application 
perf worse.
 
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to