zhangyue19921010 commented on code in PR #13674:
URL: https://github.com/apache/hudi/pull/13674#discussion_r2265813553
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -689,6 +689,15 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("When enabled, records in older schema are rewritten
into newer schema during upsert,delete and background"
+ " compaction,clustering operations.");
+ public static final ConfigProperty<Boolean>
FILE_STITCHING_BINARY_COPY_SCHEMA_EVOLUTION_ENABLE = ConfigProperty
+ .key("hoodie.file.stitching.binary.copy.schema.evolution.enable")
Review Comment:
1. Maybe we need to move this config into HoodieClusteringConfig class
2. Looks like this new strategy which group files based on schema has better
performance, Is that possible to use it as default behavior
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/PartitionAwareClusteringPlanStrategy.java:
##########
@@ -76,17 +77,31 @@ protected Pair<Stream<HoodieClusteringGroup>, Boolean>
buildClusteringGroupsForP
long totalSizeSoFar = 0;
boolean partialScheduled = false;
+
+ // Only group by schema if schema evolution is disabled
Review Comment:
Maybe we can have a new clustering plan strategy which
1. Using `SparkBinaryCopyClusteringExecutionStrategy` as the strategy of
execution
2. Users can use this new plan strategy by setting
`hoodie.clustering.plan.strategy.class` instead of knowing a new config
`hoodie.file.stitching.binary.copy.schema.evolution.enable`
##########
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java:
##########
@@ -239,6 +239,21 @@ public ClosableIterator<Pair<HoodieKey, Long>>
fetchRecordKeysWithPositions(Hood
public MessageType readSchema(HoodieStorage storage, StoragePath
parquetFilePath) {
return readMetadata(storage,
parquetFilePath).getFileMetaData().getSchema();
}
+
+ /**
+ * Get the hash code of the schema from a parquet file.
+ * This is useful for quickly comparing schemas without full comparison.
+ */
+ public static Integer readSchemaHash(HoodieStorage storage, StoragePath
parquetFilePath) {
Review Comment:
Is that safe when has a hashing conflict?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]