Looks like you are using hash based shuffling and not sort based shuffling which creates a single file per maptask.
On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <umesh.ka...@gmail.com> wrote: > Hi I have a Spark job which deals with large skewed dataset. I have around > 1000 Hive partitions to process in four different tables every day. So if I > go with 200 spark.sql.shuffle.partitions default partitions created by > Spark > I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont > be > good for HDFS name node I have been told if you keep on creating such large > no of small small files namenode will crash is it true? please help me > understand. Anyways so to avoid creating small files I did set > spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as > per my understanding because of only one output there is so much shuffling > to do to bring all data to once reducer please correct me if I am wrong. > This is causing memory/timeout issues how do I deal with it > > I tried to give spark.shuffle.storage=0.7 also still this memory seems not > enough for it. I have 25 gb executor with 4 cores and 20 such executors > still Spark job fails please guide. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >