[Spark SQL]: Why the OptimizeSkewedJoin rule does not optimize FullOuterJoin?

2024-07-15 Thread 王仲轩(万章)
Hi,
I am a beginner in Spark and currently learning the Spark source code. I have a 
question about the AQE rule OptimizeSkewedJoin. 
I have a SQL query using SMJ FullOuterJoin, where there is read skew on the 
left side (the case is mentioned below). 
case:
remote bytes read total (min, med, max)
90.5 GiB [bytes:97189264140] (208.5 MiB [bytes:218673776], 210.0 MiB 
[bytes:220191607], 18.1 GiB [bytes:19467332173])
However, the OptimizeSkewedJoin rule does not optimize FullOuterJoin. I would 
like to know the reason behind this.
Thanks.


Re: [Issue] Spark SQL - broadcast failure

2024-07-15 Thread Sudharshan V
On Mon, 8 Jul, 2024, 7:53 pm Sudharshan V, 
wrote:

> Hi all,
>
> Been facing a weird issue lately.
> In our production code base , we have an explicit broadcast for a small
> table.
> It is just a look up table that is around 1gb in size in s3 and just had
> few million records and 5 columns.
>
> The ETL was running fine , but with no change from the codebase nor the
> infrastructure, we are getting broadcast failures. Even weird fact is the
> older size of the data is 1.4gb while for the new run is just 900 MB
>
> Below is the error message
> Cannot broadcast table that is larger than 8 GB : 8GB.
>
> I find it extremely weird considering that the data size is very well
> under the thresholds.
>
> Are there any other ways to find what could be the issue and how we can
> rectify this issue?
>
> Could the data characteristics be an issue?
>
> Any help would be immensely appreciated.
>
> Thanks
>