Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Hemant Bhanawat Thu, 20 Aug 2015 02:14:47 -0700

Looks like you are using hash based shuffling and not sort based shuffling
which creates a single file per maptask.


On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <umesh.ka...@gmail.com> wrote:

> Hi I have a Spark job which deals with large skewed dataset. I have around
> 1000 Hive partitions to process in four different tables every day. So if I
> go with 200 spark.sql.shuffle.partitions default partitions created by
> Spark
> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont
> be
> good for HDFS name node I have been told if you keep on creating such large
> no of small small files namenode will crash is it true? please help me
> understand. Anyways so to avoid creating small files I did set
> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as
> per my understanding because of only one output there is so much shuffling
> to do to bring all data to once reducer please correct me if I am wrong.
> This is causing memory/timeout issues how do I deal with it
>
> I tried to give spark.shuffle.storage=0.7 also still this memory seems not
> enough for it. I have 25 gb executor with 4 cores and 20 such executors
> still Spark job fails please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Reply via email to