Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Sid Sat, 02 Jul 2022 05:35:15 -0700

Hi Krexos,

If I understand correctly, you are trying to ask that even spark involves
disk i/o then how it is an advantage over map reduce.

Basically, Map Reduce phase writes every intermediate results to the disk.
So on an average it involves 6 times disk I/O whereas spark(assuming it has
an enough memory to store intermediate results) on an average involves 3
times less disk I/O i.e only while reading the data and writing the final
data to the disk.

Thanks,
Sid

On Sat, 2 Jul 2022, 17:58 krexos, <kre...@protonmail.com.invalid> wrote:

> Hello,
>
> One of the main "selling points" of Spark is that unlike Hadoop map-reduce
> that persists intermediate results of its computation to HDFS (disk), Spark
> keeps all its results in memory. I don't understand this as in reality when
> a Spark stage finishes it writes all of the data into shuffle files
> stored on the disk
> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
> How then is this an improvement on map-reduce?
>
> Image from https://youtu.be/7ooZ4S7Ay6Y
>
>
> thanks!
>

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to