My streaming job is creating files on S3.
The problem is that those files end up very small if I just write them to S3
directly.
This is why I use coalesce() to reduce the number of files and make them
larger.

However, coalesce shuffles data and my job processing time ends up higher
than sparkBatchIntervalMilliseconds.

I have observed that if I coalesce the number of partitions to be equal to
the cores in the cluster I get less shuffling - but that is unsubstantiated.
Is there any dependency/rule between number of executors, number of cores
etc. that I can use to minimize shuffling and at the same time achieve
minimum number of output files per batch?
What is the best practice?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-files-from-streaming-jobs-on-S3-tp26400.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to