I am new to Spark and I understand that Spark divides the executor memory into the following fractions:
*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6) *Shuffle and aggregation buffers:* Which Spark uses to store shuffle outputs. It can defined using spark.shuffle.memoryFraction. If shuffle output exceeds this fraction, then Spark will spill data to disk (default 0.2) *User code:* Spark uses this fraction to execute arbitrary user code (default 0.2) I am not mentioning the storage and shuffle safety fractions for simplicity. My question is, which memory fraction is Spark using to compute and transform RDDs that are not going to be persisted? For example: lines = sc.textFile("i am a big file.txt") count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add) count.saveAsTextFile("output") Here Spark will not load the whole file at once and will partition the input file and do all these transformations per partition in a single stage. However, which memory fraction Spark will use to load the partitioned lines, compute flatMap() and map()? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Which-memory-fraction-is-Spark-using-to-compute-RDDs-that-are-not-going-to-be-persisted-tp23942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org