After adding the sequential ids you might need a repartition? I've found
using monotically increasing id before that the df goes to a single
partition. Usually becomes clear in the spark ui though
On Tue, 6 Oct 2020, 20:38 Sachit Murarka, wrote:
> Yes, Even I tried the same first. Then I moved t
Yes, Even I tried the same first. Then I moved to join method because
shuffle spill was happening because row num without partition happens on
single task. Instead of processinf entire dataframe on single task. I have
broken down that into df1 and df2 and joining.
Because df2 is having very less da
Try to avoid broadcast. Thought this:
https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6
could be helpful.
On Tue, Oct 6, 2020 at 12:18 PM Sachit Murarka
wrote:
> Thanks Eve for response.
>
> Yes I know we can use broadcast for smaller datasets,I increased the
Thanks Eve for response.
Yes I know we can use broadcast for smaller datasets,I increased the
threshold (4Gb) for the same then also it did not work. and the df3 is
somewhat greater than 2gb.
Trying by removing broadcast as well.. Job is running since 1 hour. Will
let you know.
Thanks
Sachit
O
How many rows does df3 have? Broadcast joins are a great way to append data
stored in relatively *small* single source of truth data files to large
DataFrames. DataFrames up to 2GB can be broadcasted so a data file with
tens or even hundreds of thousands of rows is a broadcast candidate. Your
broad
Hello Users,
I am facing an issue in spark job where I am doing row number() without
partition by clause because I need to add sequential increasing IDs.
But to avoid the large spill I am not doing row number() over the complete
data frame.
Instead I am applying monotically_increasing id on actua