voonhous commented on code in PR #18065:
URL: https://github.com/apache/hudi/pull/18065#discussion_r3354476787
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -62,6 +62,18 @@ abstract class HoodieBaseHadoopFsRelationFactory(val
sqlContext: SQLContext,
val schemaSpec:
Option[StructType],
val isBootstrap: Boolean
) extends SparkAdapterSupport
with HoodieHadoopFsRelationFactory with Logging {
+ // Propagate Hudi's variant allow-reading-shredded config to Spark's SQLConf.
+ // ParquetToSparkSchemaConverter reads this from SQLConf.get(), so it must
be set
+ // before query execution starts here during table resolution
+ if (HoodieSparkUtils.gteqSpark4_0) {
+ val sqlConf = sqlContext.sparkSession.sessionState.conf
+ val hoodieParquetAllowReadingShreddedConfKey =
"hoodie.parquet.variant.allow.reading.shredded"
+ val allowReadingShredded = options.getOrElse(
+ hoodieParquetAllowReadingShreddedConfKey,
Review Comment:
The actual per-file read is already scoped on the Hadoop conf -
`Spark40ParquetReader.build` sets `VARIANT_ALLOW_READING_SHREDDED` on the
per-read `hadoopConf`, which is what `ParquetReadSupport`'s converter uses. The
session-level set here exists because Spark's
`spark.sql.variant.allowReadingShredded` is itself a session-scoped flag (there
is no per-relation equivalent), and Hudi defaults shredded reading to `true`.
I have reworked the precedence so it no longer clobbers an explicitly-set
Spark conf: table option > hoodie session key > existing Spark conf > Hudi
default.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]