That will depend on what is your transformation , your code snippet might help .
On Tue, Oct 20, 2015 at 1:53 AM, shahid ashraf <sha...@trialx.com> wrote: > Hi > > Any idea why is 50 GB shuffle read and write for 3.3 gb data > > On Mon, Oct 19, 2015 at 11:58 PM, Kartik Mathur <kar...@bluedata.com> > wrote: > >> That sounds like correct shuffle output , in spark map reduce phase is >> separated by shuffle , in map each executer writes on local disk and in >> reduce phase reducerS reads data from each executer over the network , so >> shuffle definitely hurts performance , for more details on spark shuffle >> phase please read this >> >> http://0x0fff.com/spark-architecture-shuffle/ >> >> Thanks >> Kartik >> >> On Mon, Oct 19, 2015 at 6:54 AM, shahid <sha...@trialx.com> wrote: >> >>> @all i did partitionby using default hash partitioner on data >>> [(1,data)(2,(data),(n,data)] >>> the total data was approx 3.5 it showed shuffle write 50G and on next >>> action >>> e.g count it is showing shuffle read of 50 G. i don't understand this >>> behaviour and i think the performance is getting slow with so much >>> shuffle >>> read on next tranformation operations. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-tp584p25119.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> > > > -- > with Regards > Shahid Ashraf >