Why does the order matter? Coalesce runs in parallel and if it's just writing to the file, then I imagine it would do it in whatever order it happens to be executed in each thread. If you want to sort the resulting data, I imagine you'd need to save it to some sort of data structure instead of writing to the file from coalesce, sort that data structure, then write your file.
-- Chris Miller On Sat, Mar 5, 2016 at 5:24 AM, jelez <je...@hotmail.com> wrote: > My streaming job is creating files on S3. > The problem is that those files end up very small if I just write them to > S3 > directly. > This is why I use coalesce() to reduce the number of files and make them > larger. > > However, coalesce shuffles data and my job processing time ends up higher > than sparkBatchIntervalMilliseconds. > > I have observed that if I coalesce the number of partitions to be equal to > the cores in the cluster I get less shuffling - but that is > unsubstantiated. > Is there any dependency/rule between number of executors, number of cores > etc. that I can use to minimize shuffling and at the same time achieve > minimum number of output files per batch? > What is the best practice? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-files-from-streaming-jobs-on-S3-tp26400.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >