Hello, I know some operators in Spark are expensive because of shuffle. This document describes shuffle https://www.educba.com/spark-shuffle/
and saysMore shufflings in numbers are not always bad. Memory constraints and other impossibilities can be overcome by shuffling. In RDD, the below are a few operations and examples of shuffle: – subtractByKey – groupBy – foldByKey – reduceByKey – aggregateByKey – transformations of a join of any type – distinct – cogroup I know some operations like reduceBykey are well known for creating shuffle but what I don't understand why distinct operation should cause shuffle! Thanking