floriandaniel opened a new issue, #6188: URL: https://github.com/apache/hudi/issues/6188
**Problem** I'm testing the ability of Apache Hudi to make upserts faster than the current functions on Spark. Each record contains 40 fields. The partitioning key is country_iso (a string field). There are 200 different values for this field. The partitions are quite unbalanced (US, China have much records). The problem is that I'm getting very slow performance with small datasets (~1Gb) I'm updating a string field which is not the partitioning key and the record key. The ratio of updates in my upsert dataset : 100%. This could come from the way of partitioning my Parquet file, the unbalanced partitioning, choose another partitioning key ... **Environment Description** * Hudi version : 0.11.1 * Spark version : 3.1.2-amzn-1 * Hive version : * Hadoop version : 3.2.1 (Amazon) * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no * AWS EMR : emr-6.5.0, 1 master (r5.xlarge), 2 cores (r5d.2xlarge) **Additional context** Add any other context about the problem here. **Hudi Config** ``` hoodie.index.type = BLOOM/SIMPLE hoodie.bloom.index.prune.by.ranges = false hoodie.metadata.enable = true hoodie.enable.data.skipping = true hoodie.metadata.index.column.stats.enable = true hoodie.bloom.index.use.metadata = true ``` | sample | src parquet <br> (nb in millions) <br> (size in Gb) | Updates <br> (nb in millions) <br> (size in Gb) | Upsert S3 - Simple index <br> (time in mins) | Upsert S3 - Bloom index <br> (time in mins) | |:----------:|:-------------:|:------:|:-------------:|:------:| | 1 | 8.7 M records <br> (0.9 Gb) | 0.35 M records <br> (0.05 Gb) | 1.80 | 1.88 | | 10 | 87 M records <br> (7.9 Gb) | 3.5 M records <br> (0.55 Gb) <br> | 10.5 | 21.5 | | 25 | 217M records <br> (18.7 Gb) | 8.7 M records <br> (1.1 Gb) <br> | 27.05 | 110.5 | For example, the sample_10, I've got the following results : | index_type| 2 most costly tasks | |:----------:|:-------------:| |SIMPLE| <ul><li>Building workload profile: SIMPLE_hudi_sample_10 (countByKey at HoodieJavaPairRDD.java:104) -- 1.5 min</li><li>Doing partition and writing data: SIMPLE_hudi_sample_10 (count at HoodieSparkSqlWriter.scala:643) -- 8.1 min</li></ul>| |BLOOM| <ul><li>Building workload profile: BLOOM_hudi_sample_10 (countByKey at HoodieJavaPairRDD.java:104) -- 13 min -- **IMAGE 1**</li><li>Doing partition and writing data: BLOOM_hudi_sample_10 (count at HoodieSparkSqlWriter.scala:643) -- 8.0 min -- **IMAGE 2**</li></ul>| The image below show the partition /BN, with very small parquet files.  Here is the Spark trace of an upsert with Bloom index (sample_10)  **IMAGE 1**. Building workload profile: BLOOM_hudi_sample_10 (duration : 13 min),  **IMAGE 2**. Doing partition and writing data: BLOOM_hudi_sample_10, (duration : ~8mins) :  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
