[GitHub] [hudi] xushiyan commented on a diff in pull request #5943: [HUDI-4186] Support Hudi with Spark 3.3.0

GitBox Tue, 26 Jul 2022 08:40:16 -0700


xushiyan commented on code in PR #5943:
URL: https://github.com/apache/hudi/pull/5943#discussion_r930070043



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java:
##########
@@ -130,6 +130,16 @@ public class HoodieStorageConfig extends HoodieConfig {
       .defaultValue("TIMESTAMP_MICROS")
       .withDocumentation("Sets spark.sql.parquet.outputTimestampType. Parquet 
timestamp type to use when Spark writes data to Parquet files.");
 
+  // SPARK-38094 Spark 3.3 checks if this field is enabled. Hudi has to 
provide this or there would be NPE thrown
+  // Would ONLY be effective with Spark 3.3+
+  // default value is true which is in accordance with Spark 3.3
+  public static final ConfigProperty<String> PARQUET_FIELD_ID_WRITE_ENABLED = 
ConfigProperty
+      .key("hoodie.parquet.fieldId.write.enabled")

Review Comment:
   ```suggestion
         .key("hoodie.parquet.field_id.write.enabled")
   ```



##########
pom.xml:
##########
@@ -1307,6 +1308,7 @@
             <version>${maven-surefire-plugin.version}</version>
             <configuration combine.self="append">
               <skip>${skipUTs}</skip>
+              <trimStackTrace>false</trimStackTrace>

Review Comment:
   this is also set in    
   <pluginManagement>
         <plugins>
           <plugin>
   
   is it not effective?



##########
hudi-examples/hudi-examples-spark/pom.xml:
##########
@@ -190,6 +190,12 @@
             <artifactId>spark-sql_${scala.binary.version}</artifactId>
         </dependency>
 
+        <!-- Hadoop -->
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-auth</artifactId>
+        </dependency>
+

Review Comment:
   good find. so can we now re-enable spark 3.2 quickstart test in GH action? 
check out bot.yml



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/bootstrap/HoodieSparkBootstrapSchemaProvider.java:
##########
@@ -71,11 +72,20 @@ protected Schema 
getBootstrapSourceSchema(HoodieEngineContext context, List<Pair
   }
 
   private static Schema getBootstrapSourceSchemaParquet(HoodieWriteConfig 
writeConfig, HoodieEngineContext context, Path filePath) {
-    MessageType parquetSchema = new 
ParquetUtils().readSchema(context.getHadoopConf().get(), filePath);
+    Configuration hadoopConf = context.getHadoopConf().get();
+    MessageType parquetSchema = new ParquetUtils().readSchema(hadoopConf, 
filePath);
+
+    hadoopConf.set(
+        SQLConf.PARQUET_BINARY_AS_STRING().key(),
+        SQLConf.PARQUET_BINARY_AS_STRING().defaultValueString());
+    hadoopConf.set(
+        SQLConf.PARQUET_INT96_AS_TIMESTAMP().key(),
+        SQLConf.PARQUET_INT96_AS_TIMESTAMP().defaultValueString());
+    hadoopConf.set(
+        SQLConf.CASE_SENSITIVE().key(),
+        SQLConf.CASE_SENSITIVE().defaultValueString());

Review Comment:
   dont you want to set those only when they're not set?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on a diff in pull request #5943: [HUDI-4186] Support Hudi with Spark 3.3.0

Reply via email to