Hi, all in order to understand the memory usage about spark, i do the following test
val size = 1024*1024 val array = new Array[Int](size) for(i <- 0 until size) { array(i) = i } val a = sc.parallelize(array).cache() /*4MB*/ val b = a.mapPartitions{ c => { val d = c.toArray val e = new Array[Int](2*size) /*8MB*/ val f = new Array[Int](2*size) /*8MB*/ for(i <- 0 until 2*size) { e(i) = d(i % size) f(i) = d((i+1) % size) } (e++f).toIterator }}.cache() when i compile and run in sbt, the estimated size of a and b is exactly 7 times larger than the real size 14/04/15 09:10:55 INFO storage.MemoryStore: Block rdd_0_0 stored as values to memory (estimated size 28.0 MB, free 862.9 MB) 14/04/15 09:10:55 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_0_0 in memory on ubuntu.local:59962 (size: 28.0 MB, free: 862.9 MB) 14/04/15 09:10:56 INFO storage.MemoryStore: Block rdd_1_0 stored as values to memory (estimated size 112.0 MB, free 750.9 MB) 14/04/15 09:10:56 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_1_0 in memory on ubuntu.local:59962 (size: 112.0 MB, free: 750.9 MB) but when i try it in the spark shell, the estimated size is almost equal to real size 14/04/15 09:23:27 INFO MemoryStore: Block rdd_0_0 stored as values to memory (estimated size 4.2 MB, free 292.7 MB) 14/04/15 09:23:27 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_0_0 in memory on ubuntu.local:54071 (size: 4.2 MB, free: 292.7 MB) 14/04/15 09:27:40 INFO MemoryStore: Block rdd_1_0 stored as values to memory (estimated size 17.0 MB, free 275.8 MB) 14/04/15 09:27:40 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_1_0 in memory on ubuntu.local:54071 (size: 17.0 MB, free: 275.8 MB) who knows the reason? i'm really confused about memory use in spark. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n4251/memory.png> JVM and spark memory locate at different parts of system memory, the spark code is executed in JVM memory, malloc operation like val e = new Array[Int](2*size) /*8MB*/ use JVM memory. if not cached, generated RDDs are writed back to disk, if cached, RDDs are copied to spark memory, is that right? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/storage-MemoryStore-estimated-size-7-times-larger-than-real-tp4251.html Sent from the Apache Spark User List mailing list archive at Nabble.com.