As a generic answer in a distributed environment like spark, making sure
that data is distributed evenly among all nodes (assuming every node is the
same or similar) can help performance
repartition thus controls the data distribution among all nodes. However,
it is not that straight forward. Your
Hi,
Just need some advise.
- When we have multiple spark nodes running code, under what conditions a
repartition make sense?
- Can we repartition and cache the result --> df = spark.sql("select from
...").repartition(4).cache
- If we choose a repartition (4), will that repartition ap