We removed the explicit broadcast for that particular table and it took
longer time since the join type changed from BHJ to SMJ.

I wanted to understand how I can find what went wrong with the broadcast
now.
How do I know the size of the table inside of spark memory.

I have tried to cache the table hoping I could see the table size in the
storage tab of spark UI of EMR.

But I see no data there .

Thanks

On Tue, 23 Jul, 2024, 12:48 pm Sudharshan V, <sudharshanv2...@gmail.com>
wrote:

> Hi all, apologies for the delayed response.
>
> We are using spark version 3.4.1 in jar and EMR 6.11 runtime.
>
> We have disabled the auto broadcast always and would broadcast the smaller
> tables using explicit broadcast.
>
> It was working fine historically and only now it is failing.
>
> The data sizes I mentioned was taken from S3.
>
> Thanks,
> Sudharshan
>
> On Wed, 17 Jul, 2024, 1:53 am Meena Rajani, <meenakraj...@gmail.com>
> wrote:
>
>> Can you try disabling broadcast join and see what happens?
>>
>> On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V <sudharshanv2...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Been facing a weird issue lately.
>>> In our production code base , we have an explicit broadcast for a small
>>> table.
>>> It is just a look up table that is around 1gb in size in s3 and just had
>>> few million records and 5 columns.
>>>
>>> The ETL was running fine , but with no change from the codebase nor the
>>> infrastructure, we are getting broadcast failures. Even weird fact is the
>>> older size of the data is 1.4gb while for the new run is just 900 MB
>>>
>>> Below is the error message
>>> Cannot broadcast table that is larger than 8 GB : 8GB.
>>>
>>> I find it extremely weird considering that the data size is very well
>>> under the thresholds.
>>>
>>> Are there any other ways to find what could be the issue and how we can
>>> rectify this issue?
>>>
>>> Could the data characteristics be an issue?
>>>
>>> Any help would be immensely appreciated.
>>>
>>> Thanks
>>>
>>

Reply via email to