Hi Spark Community, I raised an issue in Stackoverflow ( https://stackoverflow.com/staging-ground/79182163). Cross sending here. Can someone help.
*********** This is more of a clarification on my understanding. I get the error "Cannot broadcast the table that is larger than 8GB error" Here is my pseudo code: val riskDF = broadcast(someDF) // i want riskDF to be broadcasted as part of join, as it is small < 1GB val processDF1 = ProcessAndJoin(riskDF) // read from another source and join with riskDF val processDF2 = ProcessAndJoin(riskDF) // read from another source and join with riskDF val processDF3 = ProcessAndJoin(riskDF) // read from another source and join with riskDF // union of processDF1, processDF2 and processDF3. and then write the output to a bucket. I have spark.sql.adaptive.enabled=true. Understanding & Questions 1. I thought that the riskDF will be broadcasted only once. But based on DAG, its reading and broadcasting multiple times. 2. I think there is a fixed 8GB limit set for "broadcast" "hit" as per this code reference <https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L229> I am not sure if the 8GB limit is set for one broadcast join hit or cumulative of all bordcast joins for the entire pipeline 3. I do not think this specific error is w.r.t spark.sql.autoBroadcastJoinThreshold ( in my case iam not setting it, its left to default). As per documentation, the "broadcast" hint is independent of "spark.sql.autoBroadcastJoinThreshold". Is my understanding right? *********** Thanks.