As I've told before, I am currently writing my master's thesis on storage and memory usage in Spark. I am currently specifically looking at the different fractions of memory:
I was able to find 3 memory regions, but it seems to leave some unaccounted for: 1. spark.shuffle.memoryFraction: 20% 2. spark.storage.memoryFraction: 60% 3. spark.storage.unrollFraction: 20% of spark.storage.memoryFraction = 12% 4a. Unaccounted: 100-(20+60+12)=8% or, if unrollFraction is not only proportional to, but also resides in storage.memoryFraction: 4b. Unaccounted: 100-(20+0.8*60+0.2*60)=20% Question 1: How big is the unaccounted fraction, and what is this used for? (Expected answer: Spark environment) Question 2: What is stored into spark.storage.memoryFraction? >From the log messages, with all RDDs cached: 14/09/23 10:56:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 184.7 KB, free 47.1 GB) 14/09/23 13:13:11 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 1458.0 MB, free 47.1 GB) Expected answer: broadcast variables, cached RDDs, potentially unrolled blocks (although not seen in the messages, and not noticeable in the size reduction in these logmessages) Remark: If there is nothing else that resides in this area, then in the case that the user would not use .cache() or .persist(MEMORY), a lot of memory is kept unused, since the broadcast is connectible small, and unroll, if stored here, takes a max of 20% of the 60%, right? Question 3: Which RDDs are not only instantiated, but also actually filled with data? I am trying to estimate the dataset I have. I know that because of lazy evaluation, we can never be certain, but it should be possible to estimate a minimum. Is it safe to assume that at least the RDDs that are the output of a sort/shuffle stage, and the ones that the user calls {cache(),persist(MEMORY),collect()} on, are not only instantiated, but also filled with data? And are there any other assumptions we can make, for instance about the other RDDs? Question 4a: Where is intermediate data between stages stored? Question 4b: Where is intermediate data during stages stored? When I do not use rdd.cache(), I do not see the memory in storage.memoryFraction go up. Therefore, I think we can eliminate this fraction. The intermediate data from a sort/shuffle uses the ExternalSorter or the ExternalAppendOnlyMap, which relates to the shuffle portion. Is this data moved & removed at the end of the stage, or does the next stage retrieve it from here? Is there any more intermediate data? If only RDDs that relate to a sort/shuffle are filled, then I expect it to be in this area, but it might also be possible that these are moved once the particular shuffle finishes? Question 5: If I have sufficient memory (256G), will there be a difference in execution time between caching no RDDs and caching all RDDs? I did not expect it, but my intermediate results show a 1.5 to 2x difference. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-memory-regions-tp8577.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org