YuweiXiao opened a new issue #5107: URL: https://github.com/apache/hudi/issues/5107
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** When doing hudi datasource writing benchmark, we observed a large amount CPU time is spent in converting dataframe into RDD (`HoodieSparkUtils::createRdd`). By looking at the profiling flame graph, we found around 80% of the reading time (source -> dataframe -> RDD) is spent in constructing internal variables of `AvroSerializer`. ``` df.mapPartitions ( rows => { val convert = new AvroSerializer() rows.map( r => convert(r)) }) ``` The above code is the pseudo code version of the current `createRdd` implementation. At first glance, we thought the variable `convert` is initialized once for each data partition, which should not cost too much. However, looking at its source code, it actually maintains a lambda function with some variables initialized inside. So for each input row, we have to do an almost full initialization of `AvroSerializer`. Because `AvroSerializer` resides in spark-avro lib, it is not easy to directly optimize it in Hudi codebase. I am wondering if there is any workarounds to this, e.g., other way to convert df -> RDD, or re-implement a better version of `AvroSerializer` in Hudi codebase. **To Reproduce** NA **Expected behavior** `AvroSerializer` is initialized once for each data partition, or even once in driver to serialize to executors. **Environment Description** * Hudi version : master * Spark version : spark3.2.0 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : Aliyun OSS * Running on Docker? (yes/no) : no **Additional context**  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org