Getting "Cannot broadcast the table that is larger than 8GB error" - Clarification

Lakshminarayana Chari Tue, 12 Nov 2024 09:48:35 -0800

Hi Spark Community,
I raised an issue in Stackoverflow (
https://stackoverflow.com/staging-ground/79182163).
Cross sending here. Can someone help.

***********

This is more of a clarification on my understanding. I get the error
"Cannot broadcast the table that is larger than 8GB error"

Here is my pseudo code:

val riskDF = broadcast(someDF) // i want riskDF to be broadcasted as
part of join, as it is small < 1GB

val processDF1 = ProcessAndJoin(riskDF) // read from another source
and join with riskDF

val processDF2 = ProcessAndJoin(riskDF) // read from another source
and join with riskDF

val processDF3 = ProcessAndJoin(riskDF) // read from another source
and join with riskDF

// union of processDF1, processDF2 and processDF3. and then write the
output to a bucket.

I have spark.sql.adaptive.enabled=true.

Understanding & Questions

I thought that the riskDF will be broadcasted only once. But based on
DAG, its reading and broadcasting multiple times.
2.

I think there is a fixed 8GB limit set for "broadcast" "hit" as per this
code reference

<https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L229>
I
am not sure if the 8GB limit is set for one broadcast join hit or
cumulative of all bordcast joins for the entire pipeline
3.

I do not think this specific error is w.r.t
spark.sql.autoBroadcastJoinThreshold ( in my case iam not setting it, its
left to default). As per documentation, the "broadcast" hint is independent
of "spark.sql.autoBroadcastJoinThreshold".

Is my understanding right?

***********

Thanks.

Getting "Cannot broadcast the table that is larger than 8GB error" - Clarification

Reply via email to