Rap70r opened a new issue #4242:
URL: https://github.com/apache/hudi/issues/4242


   Hello,
   
   We are using Spark and Hudi to upsert records into parquet in S3, extracted 
from Kafka, using EMR. The events could be either inserts or updates.
   Hudi files are partitioned and under each partition it creates Parquet 
files. The problem is, these parquet files can go up to 100MB in size and Spark 
job takes longer time to update them.
   Is it possible to split the data into multiple smaller parquet files under 
each partition instead of having one large one to make it easier for Spark to 
load the file and update? Currently, the job can take long time to update a 
large parquet file. However, it does not take long to update a smaller file. We 
also noticed that under certain partitions it creates multiple files of even 
size but sometimes it merges them into a large file.
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * EMR version : 6.4.0
       > Master Instance: 1 r5.xlarge
       > Core Instance: 1 c5.xlarge
       > Task Instance: 25 c5.xlarge
   
   * Spark version : 3.1.2
   
   * Hive version : n/a
   
   * Hadoop version : 3.2.1
   
   * Source : Kafka
   
   * Storage : S3 (as parquet)
   
   * Partitions: 230
   
   * Parallelism: 200
   
   * Operation: Upsert
   
   * Key: Concatenation of few fields
   
   * Partition : Concatenation of year, month and week of a date field
   
   * Storage Type: COPY_ON_WRITE
   
   * Running on Docker? : no
   
   Thank you
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to