[apache-spark]-spark-shuffle

Vijay Kumar Fri, 22 May 2020 01:23:54 -0700

Hi,

I am trying to thoroughly understand below concepts in spark.
1. A job is reading 2 files and performing a cartesian join.
2. Sizes of input are 55.7 mb and 67.1  mb
3. after reading input file, spark did shuffle, for both the inputs
shuffle was in KB. I want to understand why this size is not a complete
size of a file. Per my understanding, records which are required to be
shuffled from one executor to another will only be shuffled. It is not
required to shuffle whole file. Is this understanding correct?


4. what is shuffle spill(Memory) and shuffle spill(disk) , do these
represent figures for same data one on memory and other on disk. And how to
calculate these values ?
5. when does shuffle need to spill. It needs to spill when data does not
fit in memory but when are the situations or scenarios when this can happen.
6. On SQL tab  for above join situation , there are 2 exchanges
data size totals are
 190. MB(37.3MB,51.0MB,51.0MB)
  228.9 MB(46.9MB,60.6MB,60.6MB) how these figures are calculated
and
on Sort  below are the figures
524.2MB (Peak Memory Total)(min,med,max)(64KB,64KB,128MB)
576.0 MB (Peak memory Total) (min, med,max)(144MB,144MB,144MB)

I am trying to understand many things if you can help me with some kind of
guide or link or book where i will be able to get answers to above question
along with other more questions it will be great.

[apache-spark]-spark-shuffle

Reply via email to