We removed the explicit broadcast for that particular table and it took longer time since the join type changed from BHJ to SMJ.
I wanted to understand how I can find what went wrong with the broadcast now. How do I know the size of the table inside of spark memory. I have tried to cache the table hoping I could see the table size in the storage tab of spark UI of EMR. But I see no data there . Thanks On Tue, 23 Jul, 2024, 12:48 pm Sudharshan V, <sudharshanv2...@gmail.com> wrote: > Hi all, apologies for the delayed response. > > We are using spark version 3.4.1 in jar and EMR 6.11 runtime. > > We have disabled the auto broadcast always and would broadcast the smaller > tables using explicit broadcast. > > It was working fine historically and only now it is failing. > > The data sizes I mentioned was taken from S3. > > Thanks, > Sudharshan > > On Wed, 17 Jul, 2024, 1:53 am Meena Rajani, <meenakraj...@gmail.com> > wrote: > >> Can you try disabling broadcast join and see what happens? >> >> On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V <sudharshanv2...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> Been facing a weird issue lately. >>> In our production code base , we have an explicit broadcast for a small >>> table. >>> It is just a look up table that is around 1gb in size in s3 and just had >>> few million records and 5 columns. >>> >>> The ETL was running fine , but with no change from the codebase nor the >>> infrastructure, we are getting broadcast failures. Even weird fact is the >>> older size of the data is 1.4gb while for the new run is just 900 MB >>> >>> Below is the error message >>> Cannot broadcast table that is larger than 8 GB : 8GB. >>> >>> I find it extremely weird considering that the data size is very well >>> under the thresholds. >>> >>> Are there any other ways to find what could be the issue and how we can >>> rectify this issue? >>> >>> Could the data characteristics be an issue? >>> >>> Any help would be immensely appreciated. >>> >>> Thanks >>> >>