How many rows are you joining? How many rows in the output?
Regards
Sab
On 24-Oct-2015 2:32 am, "pratik khadloya" wrote:
> Actually the groupBy is not taking a lot of time.
> The join that i do later takes the most (95 %) amount of time.
> Also, the grouping i am doing is based on the DataFrame
Actually the groupBy is not taking a lot of time.
The join that i do later takes the most (95 %) amount of time.
Also, the grouping i am doing is based on the DataFrame api, which does not
contain any function for reduceBy... i guess the DF automatically uses
reduce by when we do a group by.
~Prat
Don't use groupBy , use reduceByKey instead , groupBy should always be
avoided as it leads to lot of shuffle reads/writes.
On Fri, Oct 23, 2015 at 11:39 AM, pratik khadloya
wrote:
> Sorry i sent the wrong join code snippet, the actual snippet is
>
> ggImpsDf.join(
>aggRevenueDf,
>aggImps
Sorry i sent the wrong join code snippet, the actual snippet is
ggImpsDf.join(
aggRevenueDf,
aggImpsDf("id_1") <=> aggRevenueDf("id_1")
&& aggImpsDf("id_2") <=> aggRevenueDf("id_2")
&& aggImpsDf("day_hour") <=> aggRevenueDf("day_hour")
&& aggImpsDf("day_hour_2") <=> aggRevenue
Hello,
Data about my spark job is below. My source data is only 916MB (stage 0)
and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a
very long time and as i see the shuffled data is 614GB. Is this something
expected? Both the data sets produce 200 partitions.
Stage IdDescri