Please ignore this question. https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow shows pandas udf should have avoided jvm<->Python SerDe by maintaining one data copy in memory. spark.sql.execution.arrow.enabled is false by default. I think I missed enabling spark.sql.execution.arrow.enabled. Thanks. Regards.
On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang <jiangok2...@gmail.com> wrote: > Hi, > > I am using pyspark Grouped Map pandas UDF ( > https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). > Functionality wise it works great. However, serDe causes a lot of perf > hits. To optimize this UDF, can I do either below: > > 1. use a java UDF to completely replace the python Grouped Map pandas UDF. > 2. The Python Grouped Map pandas UDF calls a java function internally. > > Which way is more promising and how? Thanks for any pointers. > > Thanks > Lian > > > > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>