Increase or Decrease the number of data partitions: Since a data partition
represents the quantum of data to be processed together by a single Spark
Task, there could be situations:
(a) Where existing number of data partitions are not sufficient enough in
order to maximize the usage of available r
A much better one-liner (easier to understand the UI because it will be 1
simple job with 2 stages):
```
spark.read.text("README.md").repartition(2).take(1)
```
Attila Zsolt Piros wrote
> No, it won't be reused.
> You should reuse the dateframe for reusing the shuffle blocks (and cached
> data).
No, it won't be reused.
You should reuse the dateframe for reusing the shuffle blocks (and cached
data).
I know this because the two actions will lead to building a two separate
DAGs, but I will show you a way how you could check this on your own (with a
small simple spark application).
For this
Hi,
An interesting question that I must admit I'm not sure how to answer myself
actually :)
Off the top of my head, I'd **guess** unless you cache the first query
these two queries would share nothing. With caching, there's a phase in
query execution when a canonicalized version of a query is use