subject:"Re\: understanding spark shuffle file re\-use better"

Re: understanding spark shuffle file re-use better

2021-02-18 Thread Mandloi87

Increase or Decrease the number of data partitions: Since a data partition represents the quantum of data to be processed together by a single Spark Task, there could be situations: (a) Where existing number of data partitions are not sufficient enough in order to maximize the usage of available r

Re: understanding spark shuffle file re-use better

2021-02-12 Thread Attila Zsolt Piros

A much better one-liner (easier to understand the UI because it will be 1 simple job with 2 stages): ``` spark.read.text("README.md").repartition(2).take(1) ``` Attila Zsolt Piros wrote > No, it won't be reused. > You should reuse the dateframe for reusing the shuffle blocks (and cached > data).

Re: understanding spark shuffle file re-use better

2021-02-11 Thread Attila Zsolt Piros

No, it won't be reused. You should reuse the dateframe for reusing the shuffle blocks (and cached data). I know this because the two actions will lead to building a two separate DAGs, but I will show you a way how you could check this on your own (with a small simple spark application). For this

Re: understanding spark shuffle file re-use better

2021-01-17 Thread Jacek Laskowski

Hi, An interesting question that I must admit I'm not sure how to answer myself actually :) Off the top of my head, I'd **guess** unless you cache the first query these two queries would share nothing. With caching, there's a phase in query execution when a canonicalized version of a query is use

Re: understanding spark shuffle file re-use better

Re: understanding spark shuffle file re-use better

Re: understanding spark shuffle file re-use better

Re: understanding spark shuffle file re-use better

4 matches

Site Navigation

Mail list logo

Footer information