I was reviewing a spark java application running on aws emr. The code was like: RDD.reduceByKey(func).coalesce(number).saveAsTextFile()
That stage took hours to complete. I changed to: RDD.reduceByKey(func, number).saveAsTextFile() And it now takes less than 2 minutes, and the final output is the same. So, is it a bug or a feature? Why spark doesn't treat a coalesce after a reduce like a reduce with output partitions parameterized? Just for understanding, Thanks, Pedro.