Coalesce vs reduce operation parameter

Pedro Tuero Thu, 18 Mar 2021 14:47:09 -0700

I was reviewing a spark java application running on aws emr.

The code was like:
RDD.reduceByKey(func).coalesce(number).saveAsTextFile()


That stage took hours to complete.
I changed to:
RDD.reduceByKey(func, number).saveAsTextFile()
And it now takes less than 2 minutes, and the final output is the same.

So, is it a bug or a feature?
Why spark doesn't treat a coalesce after a reduce like a reduce with output
partitions parameterized?

Just for understanding,
Thanks,
Pedro.

Coalesce vs reduce operation parameter

Reply via email to