Yann Byron created HUDI-3214: -------------------------------- Summary: [UMBRELLA] optimize auto partition in spark Key: HUDI-3214 URL: https://issues.apache.org/jira/browse/HUDI-3214 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration, Writer Core Reporter: Yann Byron
recently, if partition's value has the format like "pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition automatically. The directory of this table will have multi partition structure. I think it's unpredictable. So create this umbrella task to optimize auto partition in order to make the behavior more reasonable. Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+. There are a few of sub tasks: * add a flag to control whether enable auto-partition, to make the default behavior reasonable.. * achieve a new key generator designed specifically for this scenario. * solve the bug about the different schema when enable *hoodie.file.index.enable* or not in this case. Test Codes: {code:java} import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) newDf.write.format("hudi"). options(getQuickstartWriteConfigs). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "uuid"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). option(TABLE_NAME, tableName). mode(Overwrite). save(basePath) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)