floriandaniel opened a new issue, #6188:
URL: https://github.com/apache/hudi/issues/6188

   **Problem**
   I'm testing the ability of Apache Hudi to make upserts faster than the 
current functions on Spark.
   Each record contains 40 fields. The partitioning key is country_iso (a 
string field). There are 200 different values for this field. The partitions 
are quite unbalanced (US, China have much records).
   The problem is that I'm getting very slow performance with small datasets 
(~1Gb)
   I'm updating a string field which is not the partitioning key and the record 
key.
   The ratio of updates in my upsert dataset : 100%. 
   
   This could come from the way of partitioning my Parquet file, the unbalanced 
partitioning, choose another partitioning key ...
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.1.2-amzn-1
   
   * Hive version :
   
   * Hadoop version : 3.2.1 (Amazon)
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   * AWS EMR : emr-6.5.0, 1 master (r5.xlarge), 2 cores (r5d.2xlarge)
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Hudi Config**
   
   ```
   hoodie.index.type = BLOOM/SIMPLE
   hoodie.bloom.index.prune.by.ranges = false
   hoodie.metadata.enable = true
   hoodie.enable.data.skipping = true
   hoodie.metadata.index.column.stats.enable = true
   hoodie.bloom.index.use.metadata = true
   ```
   
   | sample | src parquet  <br> (nb in millions) <br> (size in Gb) |  Updates  
<br> (nb in millions)  <br> (size in Gb) | Upsert S3 - Simple index <br> (time 
in mins) | Upsert S3 - Bloom index <br> (time in mins) |
   |:----------:|:-------------:|:------:|:-------------:|:------:|
   | 1   |  8.7 M records <br> (0.9 Gb)  | 0.35 M records <br> (0.05 Gb) | 1.80 
  | 1.88   |
   | 10 |    87 M records <br> (7.9 Gb) | 3.5 M records <br> (0.55 Gb) <br>   | 
10.5   | 21.5   |
   | 25 | 217M records <br> (18.7 Gb) | 8.7 M records <br> (1.1 Gb) <br>    | 
27.05 | 110.5 |
   
   For example, the sample_10, I've got the following results :
   | index_type| 2 most costly tasks |
   |:----------:|:-------------:|
   |SIMPLE| <ul><li>Building workload profile: SIMPLE_hudi_sample_10 
(countByKey at HoodieJavaPairRDD.java:104) -- 1.5 min</li><li>Doing partition 
and writing data: SIMPLE_hudi_sample_10 (count at 
HoodieSparkSqlWriter.scala:643) -- 8.1 min</li></ul>|
   |BLOOM| <ul><li>Building workload profile: BLOOM_hudi_sample_10 (countByKey 
at HoodieJavaPairRDD.java:104) -- 13 min -- **IMAGE 1**</li><li>Doing partition 
and writing data: BLOOM_hudi_sample_10 (count at 
HoodieSparkSqlWriter.scala:643) -- 8.0 min -- **IMAGE 2**</li></ul>|
   
   The image below show the partition /BN, with very small parquet files.
   
![partition_bn](https://user-images.githubusercontent.com/32508360/180441763-2b16f072-f15f-46ca-b81b-9495ce99f9e6.JPG)
   
   Here is the Spark trace of an upsert with Bloom index (sample_10)
   ![trace bloom sample 
10](https://user-images.githubusercontent.com/32508360/180442010-de80e309-4dd5-4a16-b73e-1c9b6e619bca.JPG)
   
   **IMAGE 1**. Building workload profile: BLOOM_hudi_sample_10 (duration : 13 
min), 
   ![spark 
2](https://user-images.githubusercontent.com/32508360/180442419-a9b6caf7-34ea-4be0-86ca-812a1689bc19.JPG)
   
   **IMAGE 2**. Doing partition and writing data: BLOOM_hudi_sample_10, 
(duration : ~8mins) :
   ![spark 
executor](https://user-images.githubusercontent.com/32508360/180442252-7b664e65-9a27-495d-a878-0e28f63c6591.JPG)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to