[GitHub] [hudi] YuweiXiao opened a new issue #5107: [SUPPORT] High performance costs of AvroSerializer in Datasource writing

GitBox Wed, 23 Mar 2022 01:42:38 -0700


YuweiXiao opened a new issue #5107:
URL: https://github.com/apache/hudi/issues/5107



   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   When doing hudi datasource writing benchmark, we observed a large amount CPU 
time is spent in converting dataframe into RDD (`HoodieSparkUtils::createRdd`). 
By looking at the profiling flame graph, we found around 80% of the reading 
time (source -> dataframe -> RDD) is spent in constructing internal variables 
of `AvroSerializer`.
   
   ```
   df.mapPartitions ( rows => {
     val convert = new AvroSerializer()
     rows.map( r => convert(r))
   })
   ```
   
   The above code is the pseudo code version of the current `createRdd` 
implementation. At first glance, we thought the variable `convert` is 
initialized once for each data partition, which should not cost too much. 
However, looking at its source code, it actually maintains a lambda function 
with some variables initialized inside. So for each input row, we have to do an 
almost full initialization of `AvroSerializer`.
   
   Because `AvroSerializer` resides in spark-avro lib, it is not easy to 
directly optimize it in Hudi codebase. I am wondering if there is any 
workarounds to this, e.g., other way to convert df -> RDD, or re-implement a 
better version of `AvroSerializer` in Hudi codebase.
   
   
   **To Reproduce**
   
   NA
   
   **Expected behavior**
   
   `AvroSerializer` is initialized once for each data partition, or even once 
in driver to serialize to executors.
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : spark3.2.0
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : Aliyun OSS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   
![image](https://user-images.githubusercontent.com/9959868/159657799-7910c588-ad59-41d5-b171-a2cc89333aef.png)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YuweiXiao opened a new issue #5107: [SUPPORT] High performance costs of AvroSerializer in Datasource writing

Reply via email to