[jira] [Created] (HUDI-3214) [UMBRELLA] optimize auto partition in spark

Yann Byron (Jira) Tue, 11 Jan 2022 06:33:13 -0800

Yann Byron created HUDI-3214:
--------------------------------

             Summary: [UMBRELLA] optimize auto partition in spark
                 Key: HUDI-3214
                 URL: https://issues.apache.org/jira/browse/HUDI-3214
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Spark Integration, Writer Core
            Reporter: Yann Byron



recently, if partition's value has the format like "pt1=xxxx/pt2=yyyy/pt3=zzzz" 
which split by slash, Hudi will partition automatically. The directory of this 
table will have multi partition structure.

I think it's unpredictable. So create this umbrella task to optimize auto 
partition in order to make the behavior more reasonable.

Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.

There are a few of sub tasks:
 * add a flag to control whether enable auto-partition, to make the default 
behavior reasonable..
 * achieve a new key generator designed specifically for this scenario.
 * solve the bug about the different schema when enable 
*hoodie.file.index.enable* or not in this case.

 

Test Codes: 
{code:java}
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))

val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", 
"(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))

newDf.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3214) [UMBRELLA] optimize auto partition in spark

Reply via email to