Re: [PR] [HUDI-7219] Add caching support for HFileBlocks [hudi]

via GitHub Mon, 15 Sep 2025 09:46:54 -0700


yihua commented on code in PR #13724:
URL: https://github.com/apache/hudi/pull/13724#discussion_r2349559857



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -89,4 +89,32 @@ public class HoodieReaderConfig extends HoodieConfig {
       "hoodie.write.record.merge.custom.implementation.classes";
   public static final String 
RECORD_MERGE_IMPL_CLASSES_DEPRECATED_WRITE_CONFIG_KEY =
       "hoodie.datasource.write.record.merger.impls";
+
+  public static final ConfigProperty<Boolean> HFILE_BLOCK_CACHE_ENABLED = 
ConfigProperty
+      .key("hoodie.hfile.block.cache.enabled")
+      .defaultValue(false)
+      .markAdvanced()
+      .sinceVersion("1.1.0")
+      .withDocumentation("Enable HFile block-level caching for metadata files. 
This caches frequently "
+          + "accessed HFile blocks in memory to reduce I/O operations during 
metadata queries. "
+          + "Improves performance for workloads with repeated metadata access 
patterns.");
+
+  public static final ConfigProperty<Integer> HFILE_BLOCK_CACHE_SIZE = 
ConfigProperty
+      .key("hoodie.hfile.block.cache.size")
+      .defaultValue(100)

Review Comment:
   nit (non-blocking): is it possible to control the overall size of the cached 
blocks in memory instead of the the number of blocks, since the size of the 
HFile blocks can be different depending on the MDT partitions and the write 
config which HFile is written?



##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HFileReaderFactory.java:
##########
@@ -55,9 +58,34 @@ public HFileReaderFactory(HoodieStorage storage,
   public HFileReader createHFileReader() throws IOException {
     final long fileSize = determineFileSize();
     final SeekableDataInputStream inputStream = createInputStream(fileSize);
+    
+    if (shouldEnableBlockCaching()) {
+      HFileReaderConfig config = createHFileReaderConfig();
+      String filePath = getFilePath();
+      return new CachingHFileReaderImpl(inputStream, fileSize, filePath, 
config);
+    }
+    
     return new HFileReaderImpl(inputStream, fileSize);
   }
 
+  private boolean shouldEnableBlockCaching() {
+    return metadataConfig.getHFileBlockCacheEnabled();
+  }
+
+  private HFileReaderConfig createHFileReaderConfig() {
+    int blockCacheSize = metadataConfig.getHFileBlockCacheSize();
+    int cacheTtlMinutes = metadataConfig.getHFileBlockCacheTTLMinutes();
+    return new HFileReaderConfig(blockCacheSize, cacheTtlMinutes);
+  }
+
+  private String getFilePath() {
+    if (fileSource.isLeft()) {
+      return fileSource.asLeft().toString();
+    }
+    // For byte array content, use a hash-based identifier
+    return "bytes:" + Arrays.hashCode(fileSource.asRight());

Review Comment:
   I think we should probably use log file name and block sequence to avoid 
this overhead as a follow-up; but if the overhead is low, it's OK for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7219] Add caching support for HFileBlocks [hudi]

Reply via email to