subject:"Re\: PyArrow Exception in Pandas UDF GROUPEDAGG\(\)"

RE: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread Gautham Acharya

() function takes a numPartitions column. What other options can I explore? --gautham -Original Message- From: ZHANG Wei Sent: Thursday, May 7, 2020 1:34 AM To: Gautham Acharya Cc: user@spark.apache.org Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG() CAUTION: This email origina

Re: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread ZHANG Wei

AFAICT, there might be data skews, some partitions got too much rows, which caused out of memory limitation. Trying .groupBy().count() or .aggregateByKey().count() may help check each partition data size. If no data skew, to increase .groupBy() parameter `numPartitions` is worth a try. -- Cheers,