Re: pandas_udf is very slow

2020-04-06 Thread Gourav Sengupta
Hi Leon, please refer to this link: https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html I have found using GROUP MAP to be a bit tricky, please refer to the statement: "All data for a group is loaded into memory before the function is applied. This can lead to out of memory

Re: pandas_udf is very slow

2020-04-05 Thread Lian Jiang
Thanks Silvio. I need grouped map pandas UDF which takes a spark data frame as the input and outputs a spark data frame having a different shape from input. Grouped map is kind of unique to pandas udf and I have trouble to find a similar non pandas udf for an apple to apple comparison. Let me kn

Re: pandas_udf is very slow

2020-04-05 Thread Silvio Fiorito
Your 2 examples are doing different things. The Pandas UDF is doing a grouped map, whereas your Python UDF is doing an aggregate. I think you want your Pandas UDF to be PandasUDFType.GROUPED_AGG? Is your result the same? From: Lian Jiang Date: Sunday, April 5, 2020 at 3:28 AM To: user Subjec