[ 
https://issues.apache.org/jira/browse/HUDI-9787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018187#comment-18018187
 ] 

Timothy Brown commented on HUDI-9787:
-------------------------------------

This looks like it is already fixed in the master branch: 
https://github.com/apache/hudi/pull/12125/files

> HoodieFileIndex file listing perf issue
> ---------------------------------------
>
>                 Key: HUDI-9787
>                 URL: https://issues.apache.org/jira/browse/HUDI-9787
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>            Reporter: Davis Zhang
>            Priority: Major
>             Fix For: 1.2.0
>
>         Attachments: image-2025-09-04-10-23-53-374.png
>
>
> org.apache.hudi.HoodieFileIndex#filterFileSlices
>  
> {code:java}
> val prunedPartitionsAndFilteredFileSlices = prunedPartitionsAndFileSlices.map 
> {
>   case (partitionOpt, fileSlices) =>
>     // Filter in candidate files based on the col-stats or record level index 
> lookup
>     val candidateFileSlices: Seq[FileSlice] = {
>       fileSlices.filter(fs => {
>         val fileSliceFiles = 
> fs.getLogFiles.map[String](JFunction.toJavaFunction[HoodieLogFile, String](lf 
> => lf.getPath.getName))
>           .collect(Collectors.toSet[String])
>         val baseFileStatusOpt = 
> getBaseFileStatus(Option.apply(fs.getBaseFile.orElse(null)))
>         baseFileStatusOpt.exists(f => fileSliceFiles.add(f.getPath.getName))
>         // NOTE: This predicate is true when {@code Option} is empty         
> candidateFilesNamesOpt.forall(files => files.exists(elem => 
> fileSliceFiles.contains(elem)))
>       })
>     }
>     totalFileSliceSize += fileSlices.size
>     candidateFileSliceSize += candidateFileSlices.size
>     (partitionOpt, candidateFileSlices)
> }
> val skippingRatio =
>   if (!areAllFileSlicesCached) -1
>   else if (getAllFiles().nonEmpty && totalFileSliceSize > 0)
>     (totalFileSliceSize - candidateFileSliceSize) / 
> totalFileSliceSize.toDouble
>   else 0
> logInfo(s"Total file slices: $totalFileSliceSize; " +
>   s"candidate file slices after data skipping: $candidateFileSliceSize; " +
>   s"skipping percentage $skippingRatio")
>  {code}
> This is doing a nested for loop like processing between 2 scala list 
> candidateFilesNamesOpt and prunedPartitionsAndFileSlices to figure out the 
> overlap.
>  
> In production use case (0.14.1 based branch), we saw that 
> !image-2025-09-04-10-23-53-374.png!
> It means both list are of size 55.8k. The total iteration the above for loop 
> is 55.8k*55.8k which is not trivial.
> We noticed on a cluster with 32 GB memory with no CPU throttling (this logic 
> is anyway single threaded), in this set up the hoodie file index choose to 
> use 
> COLUMN_STATS_INDEX_PROCESSING_MODE_IN_MEMORY which means everything is on 
> driver.
>  
> That nested for loop took ~10 min to complete. During this time, no spark 
> tasks are running because this is driver only logic and there is no spark 
> tasks available for execution. The performance impact could be more profound 
> if the auto scaling logic of user's spark application choose to toss away 
> executors and later scale them back. This churning would make the e2e spark 
> application perf worse.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to