Re: Huge shuffle data size

2015-10-23 Thread Sabarish Sasidharan
How many rows are you joining? How many rows in the output? Regards Sab On 24-Oct-2015 2:32 am, "pratik khadloya" wrote: > Actually the groupBy is not taking a lot of time. > The join that i do later takes the most (95 %) amount of time. > Also, the grouping i am doing is based on the DataFrame

Re: Huge shuffle data size

2015-10-23 Thread pratik khadloya
Actually the groupBy is not taking a lot of time. The join that i do later takes the most (95 %) amount of time. Also, the grouping i am doing is based on the DataFrame api, which does not contain any function for reduceBy... i guess the DF automatically uses reduce by when we do a group by. ~Prat

Re: Huge shuffle data size

2015-10-23 Thread Kartik Mathur
Don't use groupBy , use reduceByKey instead , groupBy should always be avoided as it leads to lot of shuffle reads/writes. On Fri, Oct 23, 2015 at 11:39 AM, pratik khadloya wrote: > Sorry i sent the wrong join code snippet, the actual snippet is > > ggImpsDf.join( >aggRevenueDf, >aggImps

Re: Huge shuffle data size

2015-10-23 Thread pratik khadloya
Sorry i sent the wrong join code snippet, the actual snippet is ggImpsDf.join( aggRevenueDf, aggImpsDf("id_1") <=> aggRevenueDf("id_1") && aggImpsDf("id_2") <=> aggRevenueDf("id_2") && aggImpsDf("day_hour") <=> aggRevenueDf("day_hour") && aggImpsDf("day_hour_2") <=> aggRevenue

Huge shuffle data size

2015-10-23 Thread pratik khadloya
Hello, Data about my spark job is below. My source data is only 916MB (stage 0) and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a very long time and as i see the shuffled data is 614GB. Is this something expected? Both the data sets produce 200 partitions. Stage IdDescri