Hello,
I know some operators in Spark are expensive because of shuffle.
This document describes shuffle
https://www.educba.com/spark-shuffle/

and saysMore shufflings in numbers are not always bad. Memory constraints and 
other impossibilities can be overcome by shuffling.

In RDD, the below are a few operations and examples of shuffle:
– subtractByKey
– groupBy
– foldByKey
– reduceByKey
– aggregateByKey
– transformations of a join of any type
– distinct
– cogroup
I know some operations like reduceBykey are well known for creating shuffle but 
what I don't understand why distinct operation should cause shuffle!

Thanking




Reply via email to