yihua commented on code in PR #18379:
URL: https://github.com/apache/hudi/pull/18379#discussion_r3036288887
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -217,10 +217,14 @@ private void consumeField(String field, int index,
Runnable writer) {
}
private void writeFields(InternalRow row, StructType schema, ValueWriter[]
fieldWriters) {
- for (int i = 0; i < row.numFields(); i++) {
+ for (int i = 0; i < schema.fields().length; i++) {
int index = i;
if (!row.isNullAt(i)) {
- consumeField(schema.fields()[i].name(), index, () ->
fieldWriters[index].write(row, index));
+ try {
+ consumeField(schema.fields()[i].name(), index, () ->
fieldWriters[index].write(row, index));
Review Comment:
🤖 This catch block catches `ClassCastException` and immediately rethrows it
— it's a no-op. Was there intended to be additional handling (e.g., logging or
wrapping with more context)?
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java:
##########
@@ -57,16 +62,32 @@ protected HoodieFileWriter newParquetFileWriter(
if (compressionCodecName.isEmpty()) {
compressionCodecName = null;
}
- HoodieRowParquetWriteSupport writeSupport =
getHoodieRowParquetWriteSupport(storage.getConf(), schema,
- config, enableBloomFilter(populateMetaFields, config));
+
Review Comment:
🤖 The injector code runs after `compressionCodecName` is already read from
the original `config` (line ~55). This means an injector that overrides
`PARQUET_COMPRESSION_CODEC_NAME` will have no effect in the Spark path. In the
Avro factory, `compressionCodecName` is read from `hoodieConfig`
(post-injection), so the behavior is inconsistent across engines. Could you
move the injector block above the `compressionCodecName` read, or re-read the
codec from `hoodieConfig` after injection?
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -217,10 +217,14 @@ private void consumeField(String field, int index,
Runnable writer) {
}
private void writeFields(InternalRow row, StructType schema, ValueWriter[]
fieldWriters) {
- for (int i = 0; i < row.numFields(); i++) {
+ for (int i = 0; i < schema.fields().length; i++) {
Review Comment:
🤖 Changing the loop bound from `row.numFields()` to `schema.fields().length`
seems unrelated to the config injector feature. If an InternalRow has fewer
fields than the schema (e.g., schema evolution edge case), this would cause an
IndexOutOfBoundsException on `row.isNullAt(i)` instead of silently skipping
extra schema fields. Was this an intentional fix for a separate issue? If so,
it might be worth a separate commit with its own rationale.
##########
hudi-common/src/main/java/org/apache/hudi/io/HoodieParquetConfigInjector.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.StorageConfiguration;
+import org.apache.hudi.storage.StoragePath;
+
+/**
+ * A pluggable interface that all parquet-based writers (Spark/Flink) will
invoke before creating write support
+ * or parquet file writer objects.
+ * <p>
+ * This allows users to inject custom configurations into the Parquet writer
pipeline at runtime, enabling
+ * fine-grained control over Parquet file properties such as bloom filters,
compression settings, encoding
+ * options, and other advanced Parquet configurations.
+ * <p>
+ * Implementations of this interface can modify both the storage configuration
(e.g., Hadoop Configuration)
+ * and the Hudi-specific configuration before the Parquet writer is created.
+ * <p>
+ * Example use cases:
+ * <ul>
+ * <li>Enabling column-specific Parquet bloom filters</li>
+ * <li>Setting custom compression codecs per file or partition</li>
+ * <li>Adjusting page sizes or row group sizes based on data
characteristics</li>
+ * <li>Injecting custom metadata into Parquet files</li>
+ * </ul>
+ *
+ * @since 1.2.0
+ */
+public interface HoodieParquetConfigInjector {
+
+ /**
+ * Injects custom configurations into the Parquet writer pipeline.
+ * <p>
+ * This method is invoked before creating the Parquet write support and
writer objects, allowing
+ * implementations to modify both the storage-level and Hudi-level
configurations.
+ *
+ * @param path the file path where the Parquet file will be written
+ * @param storageConf the storage configuration (e.g., Hadoop Configuration)
that will be used by the writer
+ * @param hoodieConfig the Hudi configuration containing write settings and
table properties
+ * @return a pair containing the potentially modified storage configuration
and Hudi configuration.
+ * Both configurations will be used to create the Parquet writer.
+ */
+ Pair<StorageConfiguration, HoodieConfig> withProps(StoragePath path,
StorageConfiguration storageConf, HoodieConfig hoodieConfig);
+}
Review Comment:
🤖 The interface contract doesn't specify whether implementations should
create defensive copies of `storageConf` and `hoodieConfig`, or may mutate them
in place. The test implementations (e.g., `DisableDictionaryInjector`) mutate
the input `hoodieConfig` directly, which means the caller's original config
object is modified as a side effect. It might be worth documenting in the
Javadoc whether mutation is acceptable, or having the factory pass a copy to
the injector.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]