Hello All,

Hope this email finds you well. I have a dataframe of size 8TB (parquet
snappy compressed), however I group it by a column and get a much smaller
aggregated dataframe of size 700 rows (just two columns, key and count).
When I use it like below to broadcast this aggregated result, it throws
dataframe can not be broadcasted error.

df_agg = df.groupBy('column1').count().cache()
# df_agg.count()
df_join = df.join(broadcast(df_agg), 'column1', 'left_outer')
df_join.write.parquet('PATH')

The same code works with input df size of 3TB without any modifications.

Any suggestions?

-- 
Regards,

Rishi Shah

Reply via email to