Re: [PR] feat: Adding support to inject custom configs to parquet writer [hudi]

via GitHub Sat, 04 Apr 2026 19:04:40 -0700


yihua commented on code in PR #18379:
URL: https://github.com/apache/hudi/pull/18379#discussion_r3036288887



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -217,10 +217,14 @@ private void consumeField(String field, int index, 
Runnable writer) {
   }
 
   private void writeFields(InternalRow row, StructType schema, ValueWriter[] 
fieldWriters) {
-    for (int i = 0; i < row.numFields(); i++) {
+    for (int i = 0; i < schema.fields().length; i++) {
       int index = i;
       if (!row.isNullAt(i)) {
-        consumeField(schema.fields()[i].name(), index, () -> 
fieldWriters[index].write(row, index));
+        try {
+          consumeField(schema.fields()[i].name(), index, () -> 
fieldWriters[index].write(row, index));

Review Comment:
   🤖 This catch block catches `ClassCastException` and immediately rethrows it 
— it's a no-op. Was there intended to be additional handling (e.g., logging or 
wrapping with more context)?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java:
##########
@@ -57,16 +62,32 @@ protected HoodieFileWriter newParquetFileWriter(
     if (compressionCodecName.isEmpty()) {
       compressionCodecName = null;
     }
-    HoodieRowParquetWriteSupport writeSupport = 
getHoodieRowParquetWriteSupport(storage.getConf(), schema,
-        config, enableBloomFilter(populateMetaFields, config));
+

Review Comment:
   🤖 The injector code runs after `compressionCodecName` is already read from 
the original `config` (line ~55). This means an injector that overrides 
`PARQUET_COMPRESSION_CODEC_NAME` will have no effect in the Spark path. In the 
Avro factory, `compressionCodecName` is read from `hoodieConfig` 
(post-injection), so the behavior is inconsistent across engines. Could you 
move the injector block above the `compressionCodecName` read, or re-read the 
codec from `hoodieConfig` after injection?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -217,10 +217,14 @@ private void consumeField(String field, int index, 
Runnable writer) {
   }
 
   private void writeFields(InternalRow row, StructType schema, ValueWriter[] 
fieldWriters) {
-    for (int i = 0; i < row.numFields(); i++) {
+    for (int i = 0; i < schema.fields().length; i++) {

Review Comment:
   🤖 Changing the loop bound from `row.numFields()` to `schema.fields().length` 
seems unrelated to the config injector feature. If an InternalRow has fewer 
fields than the schema (e.g., schema evolution edge case), this would cause an 
IndexOutOfBoundsException on `row.isNullAt(i)` instead of silently skipping 
extra schema fields. Was this an intentional fix for a separate issue? If so, 
it might be worth a separate commit with its own rationale.



##########
hudi-common/src/main/java/org/apache/hudi/io/HoodieParquetConfigInjector.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.StorageConfiguration;
+import org.apache.hudi.storage.StoragePath;
+
+/**
+ * A pluggable interface that all parquet-based writers (Spark/Flink) will 
invoke before creating write support
+ * or parquet file writer objects.
+ * <p>
+ * This allows users to inject custom configurations into the Parquet writer 
pipeline at runtime, enabling
+ * fine-grained control over Parquet file properties such as bloom filters, 
compression settings, encoding
+ * options, and other advanced Parquet configurations.
+ * <p>
+ * Implementations of this interface can modify both the storage configuration 
(e.g., Hadoop Configuration)
+ * and the Hudi-specific configuration before the Parquet writer is created.
+ * <p>
+ * Example use cases:
+ * <ul>
+ *   <li>Enabling column-specific Parquet bloom filters</li>
+ *   <li>Setting custom compression codecs per file or partition</li>
+ *   <li>Adjusting page sizes or row group sizes based on data 
characteristics</li>
+ *   <li>Injecting custom metadata into Parquet files</li>
+ * </ul>
+ *
+ * @since 1.2.0
+ */
+public interface HoodieParquetConfigInjector {
+
+  /**
+   * Injects custom configurations into the Parquet writer pipeline.
+   * <p>
+   * This method is invoked before creating the Parquet write support and 
writer objects, allowing
+   * implementations to modify both the storage-level and Hudi-level 
configurations.
+   *
+   * @param path the file path where the Parquet file will be written
+   * @param storageConf the storage configuration (e.g., Hadoop Configuration) 
that will be used by the writer
+   * @param hoodieConfig the Hudi configuration containing write settings and 
table properties
+   * @return a pair containing the potentially modified storage configuration 
and Hudi configuration.
+   *         Both configurations will be used to create the Parquet writer.
+   */
+  Pair<StorageConfiguration, HoodieConfig> withProps(StoragePath path, 
StorageConfiguration storageConf, HoodieConfig hoodieConfig);
+}

Review Comment:
   🤖 The interface contract doesn't specify whether implementations should 
create defensive copies of `storageConf` and `hoodieConfig`, or may mutate them 
in place. The test implementations (e.g., `DisableDictionaryInjector`) mutate 
the input `hoodieConfig` directly, which means the caller's original config 
object is modified as a side effect. It might be worth documenting in the 
Javadoc whether mutation is acceptable, or having the factory pass a copy to 
the injector.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Adding support to inject custom configs to parquet writer [hudi]

Reply via email to