I don't know actual implementation: But, to me it's still necessary as
each worker reads data separately and reduces to get local distinct these
will then need to be shuffled to find actual distinct.
On Sun, 23 Jan 2022, 17:39 ashok34...@yahoo.com.INVALID,
wrote:
> Hello,
>
> I know some operat
Hello,
I know some operators in Spark are expensive because of shuffle.
This document describes shuffle
https://www.educba.com/spark-shuffle/
and saysMore shufflings in numbers are not always bad. Memory constraints and
other impossibilities can be overcome by shuffling.
In RDD, the below are a