Re: [PR] [HUDI-9685] Merge row groups for file stitching with Parquet API and group by schema [hudi]

via GitHub Sun, 10 Aug 2025 23:49:35 -0700


zhangyue19921010 commented on code in PR #13674:
URL: https://github.com/apache/hudi/pull/13674#discussion_r2265813553



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -689,6 +689,15 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, records in older schema are rewritten 
into newer schema during upsert,delete and background"
           + " compaction,clustering operations.");
 
+  public static final ConfigProperty<Boolean> 
FILE_STITCHING_BINARY_COPY_SCHEMA_EVOLUTION_ENABLE = ConfigProperty
+      .key("hoodie.file.stitching.binary.copy.schema.evolution.enable")

Review Comment:
   1. Maybe we need to move this config into HoodieClusteringConfig class
   2. Looks like this new strategy which group files based on schema has better 
performance, Is that possible to use it as default behavior



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/PartitionAwareClusteringPlanStrategy.java:
##########
@@ -76,17 +77,31 @@ protected Pair<Stream<HoodieClusteringGroup>, Boolean> 
buildClusteringGroupsForP
 
     long totalSizeSoFar = 0;
     boolean partialScheduled = false;
+    
+    // Only group by schema if schema evolution is disabled

Review Comment:
   Maybe we can have a new clustering plan strategy which
   1. Using `SparkBinaryCopyClusteringExecutionStrategy` as the strategy of 
execution
   2. Users can use this new plan strategy by setting 
`hoodie.clustering.plan.strategy.class` instead of knowing a new config 
`hoodie.file.stitching.binary.copy.schema.evolution.enable`



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java:
##########
@@ -239,6 +239,21 @@ public ClosableIterator<Pair<HoodieKey, Long>> 
fetchRecordKeysWithPositions(Hood
   public MessageType readSchema(HoodieStorage storage, StoragePath 
parquetFilePath) {
     return readMetadata(storage, 
parquetFilePath).getFileMetaData().getSchema();
   }
+  
+  /**
+   * Get the hash code of the schema from a parquet file.
+   * This is useful for quickly comparing schemas without full comparison.
+   */
+  public static Integer readSchemaHash(HoodieStorage storage, StoragePath 
parquetFilePath) {

Review Comment:
   Is that safe when has a hashing conflict?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9685] Merge row groups for file stitching with Parquet API and group by schema [hudi]

Reply via email to