subject:"Re\: \[pyspark 2.4\] broadcasting DataFrame throws error"

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-21 Thread Rishi Shah

Thanks Amit, I was referring to dynamic partition pruning ( https://issues.apache.org/jira/browse/SPARK-11150) & adaptive query execution (https://issues.apache.org/jira/browse/SPARK-31412) in Sparkk 3 - where it would figure out right partitions & pushes the filters to input before applying the jo

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-18 Thread Amit Joshi

Hi Rishi, May be you have aready done these steps. Can you check the size of the dataframe you are trying to broadcast using logInfo(SizeEstimator.estimate(df)) and adjust the driver similarly. There is one more issue which I found was in spark 2. Broadcast does not work in cache data. It is poss

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-18 Thread Rishi Shah

Thanks Amit. I have tried increasing driver memory , also tried increasing max result size returned to the driver. Nothing works, I believe spark is not able to determine the fact that the result to be broadcasted is small enough because input data is huge? When I tried this in 2 stages, write out

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-17 Thread Amit Joshi

Hi, I think problem lies with driver memory. Broadcast in spark work by collecting all the data to driver and then driver broadcasting to all the executors. Different strategy could be employed for trasfer like bit torrent though. Please try increasing the driver memory. See if it works. Regards