Hi, I am trying to thoroughly understand below concepts in spark. 1. A job is reading 2 files and performing a cartesian join. 2. Sizes of input are 55.7 mb and 67.1 mb 3. after reading input file, spark did shuffle, for both the inputs shuffle was in KB. I want to understand why this size is not a complete size of a file. Per my understanding, records which are required to be shuffled from one executor to another will only be shuffled. It is not required to shuffle whole file. Is this understanding correct?
4. what is shuffle spill(Memory) and shuffle spill(disk) , do these represent figures for same data one on memory and other on disk. And how to calculate these values ? 5. when does shuffle need to spill. It needs to spill when data does not fit in memory but when are the situations or scenarios when this can happen. 6. On SQL tab for above join situation , there are 2 exchanges data size totals are 190. MB(37.3MB,51.0MB,51.0MB) 228.9 MB(46.9MB,60.6MB,60.6MB) how these figures are calculated and on Sort below are the figures 524.2MB (Peak Memory Total)(min,med,max)(64KB,64KB,128MB) 576.0 MB (Peak memory Total) (min, med,max)(144MB,144MB,144MB) I am trying to understand many things if you can help me with some kind of guide or link or book where i will be able to get answers to above question along with other more questions it will be great.