I was reviewing a spark java application running on aws emr.

The code was like:
RDD.reduceByKey(func).coalesce(number).saveAsTextFile()

That stage took hours to complete.
I changed to:
RDD.reduceByKey(func, number).saveAsTextFile()
And it now takes less than 2 minutes, and the final output is the same.

So, is it a bug or a feature?
Why spark doesn't treat a coalesce after a reduce like a reduce with output
partitions parameterized?

Just for understanding,
Thanks,
Pedro.

Reply via email to