Hi,

Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We
found that a partition number of 1000 is a good number to speed the process
up. However, it does not make sense to have 1000 pieces of csv files each
less than 1 kb.

We used RDD.coalesce(1) to get only 1 csv file, but it's extremely slow,
and we are not properly using our resources this way. So this is very slow:

rdd.map(...).coalesce(1).saveAsTextFile()

How is it possible to use coalesce(1) simply for concatenating the
materialized output text files? Would something like this make sense?:

rdd.map(...).coalesce(100).coalesce(1).saveAsTextFile()

Or, would something like this achieve it?:

rdd.map(...).cache().coalesce(1).saveAsTextFile()

Reply via email to