Hi Krexos, If I understand correctly, you are trying to ask that even spark involves disk i/o then how it is an advantage over map reduce.
Basically, Map Reduce phase writes every intermediate results to the disk. So on an average it involves 6 times disk I/O whereas spark(assuming it has an enough memory to store intermediate results) on an average involves 3 times less disk I/O i.e only while reading the data and writing the final data to the disk. Thanks, Sid On Sat, 2 Jul 2022, 17:58 krexos, <kre...@protonmail.com.invalid> wrote: > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results of its computation to HDFS (disk), Spark > keeps all its results in memory. I don't understand this as in reality when > a Spark stage finishes it writes all of the data into shuffle files > stored on the disk > <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>. > How then is this an improvement on map-reduce? > > Image from https://youtu.be/7ooZ4S7Ay6Y > > > thanks! >