I am new to Spark and I understand that Spark divides the executor memory
into the following fractions:

*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
.cache() and can be defined by setting spark.storage.memoryFraction (default
0.6)

*Shuffle and aggregation buffers:* Which Spark uses to store shuffle
outputs. It can defined using spark.shuffle.memoryFraction. If shuffle
output exceeds this fraction, then Spark will spill data to disk (default
0.2)

*User code:* Spark uses this fraction to execute arbitrary user code
(default 0.2)

I am not mentioning the storage and shuffle safety fractions for simplicity.

My question is, which memory fraction is Spark using to compute and
transform RDDs that are not going to be persisted? For example:

lines = sc.textFile("i am a big file.txt")
count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
1)).reduceByKey(add)
count.saveAsTextFile("output")

Here Spark will not load the whole file at once and will partition the input
file and do all these transformations per partition in a single stage.
However, which memory fraction Spark will use to load the partitioned lines,
compute flatMap() and map()?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Which-memory-fraction-is-Spark-using-to-compute-RDDs-that-are-not-going-to-be-persisted-tp23942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to