[ https://issues.apache.org/jira/browse/HUDI-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan updated HUDI-3214: -------------------------------------- Sprint: Cont' improve - 2022/02/07 > [UMBRELLA] optimize auto partition in spark > ------------------------------------------- > > Key: HUDI-3214 > URL: https://issues.apache.org/jira/browse/HUDI-3214 > Project: Apache Hudi > Issue Type: Improvement > Components: spark, writer-core > Reporter: Yann Byron > Assignee: Yann Byron > Priority: Critical > Fix For: 0.11.0 > > > recently, if partition's value has the format like > "pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition > automatically. The directory of this table will have multi partition > structure. > I think it's unpredictable. So create this umbrella task to optimize auto > partition in order to make the behavior more reasonable. > Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+. > There are a few of sub tasks: > * add a flag to control whether enable auto-partition, to make the default > behavior reasonable.. > * achieve a new key generator designed specifically for this scenario. > * solve the bug about the different schema when enable > *hoodie.file.index.enable* or not in this case. > > Test Codes: > {code:java} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "hudi_trips_cow" > val basePath = "file:///tmp/hudi_trips_cow" > val dataGen = new DataGenerator > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", > "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) > newDf.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "ts"). > option(RECORDKEY_FIELD_OPT_KEY, "uuid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > option(TABLE_NAME, tableName). > mode(Overwrite). > save(basePath) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)