Hi, Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We found that a partition number of 1000 is a good number to speed the process up. However, it does not make sense to have 1000 pieces of csv files each less than 1 kb.
We used RDD.coalesce(1) to get only 1 csv file, but it's extremely slow, and we are not properly using our resources this way. So this is very slow: rdd.map(...).coalesce(1).saveAsTextFile() How is it possible to use coalesce(1) simply for concatenating the materialized output text files? Would something like this make sense?: rdd.map(...).coalesce(100).coalesce(1).saveAsTextFile() Or, would something like this achieve it?: rdd.map(...).cache().coalesce(1).saveAsTextFile()