voonhous commented on code in PR #18062:
URL: https://github.com/apache/hudi/pull/18062#discussion_r2945702524
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -139,21 +162,118 @@ public HoodieRowParquetWriteSupport(Configuration conf,
StructType structType, O
HoodieSchema parsedSchema = HoodieSchema.parse(schemaString);
return HoodieSchemaUtils.addMetadataFields(parsedSchema,
config.getBooleanOrDefault(ALLOW_OPERATION_METADATA_FIELD));
});
+ // Generate shredded schema if there are shredded Variant columns
+ this.shreddedSchema = generateShreddedSchema(structType, schema);
ParquetWriteSupport.setSchema(structType, hadoopConf);
- this.rootFieldWriters = getFieldWriters(structType, schema);
+ // Use shreddedSchema for creating writers when shredded Variants are
present
+ this.rootFieldWriters = getFieldWriters(shreddedSchema, schema);
this.hadoopConf = hadoopConf;
this.bloomFilterWriteSupportOpt =
bloomFilterOpt.map(HoodieBloomFilterRowWriteSupport::new);
}
+ /**
+ * Generates a shredded schema from the given structType and hoodieSchema.
+ * <p>
+ * For Variant fields that are configured for shredding (based on
HoodieSchema.Variant.isShredded()), the VariantType is replaced with a shredded
struct schema.
+ * <p>
+ * Shredding behavior is controlled by:
+ * <ul>
+ * <li>{@code hoodie.parquet.variant.write.shredding.enabled} - Master
switch for shredding (default: true).
+ * When false, no shredding happens regardless of schema
configuration.</li>
+ * <li>{@code hoodie.parquet.variant.force.shredding.schema.for.test} -
When set, forces this DDL schema
+ * as the typed_value schema for ALL variant columns, overriding
schema-driven shredding.</li>
+ * </ul>
+ *
+ * @param structType The original Spark StructType
+ * @param hoodieSchema The HoodieSchema containing shredding information
+ * @return A StructType with shredded Variant fields replaced by their
shredded schemas
+ */
+ private StructType generateShreddedSchema(StructType structType,
HoodieSchema hoodieSchema) {
Review Comment:
Without the shredding config, the original behaviour should still be intact.
We recursively walk through the nested structure to find the variant, then do
shredding on it if possible.
However, with forceShredding used for testing in hudi, we do not handle
things recursively. FWIU, this is inline with Spark's behaviour here:
https://github.com/apache/spark/blob/c6b4a8637ee3f3c2cf569522f285541cb9b71fa6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala#L425-L439
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]