As a generic answer in a distributed environment like spark, making sure that data is distributed evenly among all nodes (assuming every node is the same or similar) can help performance
repartition thus controls the data distribution among all nodes. However, it is not that straight forward. Your mileage varies simply because changing the distribution is related to a cost for physical data movement on the cluster nodes (a so-called shuffle). So there is a cost associated with repartition due to creation of shuffle. You need to see the execution plan by using df.explain() or looking at spark GUI to see the physical plan. In simplest form repartition(n) will distribute the data randomly and I think that is the most common form. However, this also depends on the volume of data. For smaller volumes I don't think it really matters. However, for large volumes of data, repartition may be an option, if the data in joining is skewed. However, you need to know the volume of data before deploying partitioning. HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 9 Nov 2020 at 16:57, ashok34...@yahoo.com.INVALID <ashok34...@yahoo.com.invalid> wrote: > Hi, > > Just need some advise. > > > 1. When we have multiple spark nodes running code, under what > conditions a repartition make sense? > 2. Can we repartition and cache the result --> df = spark.sql("select > from ...").repartition(4).cache > 3. If we choose a repartition (4), will that repartition applies to > all nodes running the code and how can one see that? > > > Thanks > > >