Hi Spark Community,
I raised an issue in Stackoverflow (
https://stackoverflow.com/staging-ground/79182163).
Cross sending here. Can someone help.

***********

This is more of a clarification on my understanding. I get the error
"Cannot broadcast the table that is larger than 8GB error"

Here is my pseudo code:

val riskDF = broadcast(someDF) // i want riskDF to be broadcasted as
part of join, as it is small < 1GB

val processDF1 = ProcessAndJoin(riskDF) // read from another source
and join with riskDF

val processDF2 = ProcessAndJoin(riskDF) // read from another source
and join with riskDF

val processDF3 = ProcessAndJoin(riskDF) // read from another source
and join with riskDF

// union of processDF1, processDF2 and processDF3. and then write the
output to a bucket.

I have spark.sql.adaptive.enabled=true.

Understanding & Questions

   1.

   I thought that the riskDF will be broadcasted only once. But based on
   DAG, its reading and broadcasting multiple times.
   2.

   I think there is a fixed 8GB limit set for "broadcast" "hit" as per this
   code reference
   
<https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L229>
I
   am not sure if the 8GB limit is set for one broadcast join hit or
   cumulative of all bordcast joins for the entire pipeline
   3.

   I do not think this specific error is w.r.t
   spark.sql.autoBroadcastJoinThreshold ( in my case iam not setting it, its
   left to default). As per documentation, the "broadcast" hint is independent
   of "spark.sql.autoBroadcastJoinThreshold".

Is my understanding right?

***********

Thanks.

Reply via email to