storage.MemoryStore estimated size 7 times larger than real

wxhsdp Mon, 14 Apr 2014 19:08:16 -0700

Hi, all
in order to understand the memory usage about spark, i do the following test


val size = 1024*1024
val array = new Array[Int](size)

for(i <- 0 until size) {
array(i) = i
}

val a = sc.parallelize(array).cache() /*4MB*/

val b = a.mapPartitions{ c => {
  val d = c.toArray

  val e = new Array[Int](2*size) /*8MB*/
  val f = new Array[Int](2*size) /*8MB*/

  for(i <- 0 until 2*size) {
    e(i) = d(i % size)
    f(i) = d((i+1) % size)
  }

  (e++f).toIterator
}}.cache()

when i compile and run in sbt, the estimated size of a and b is exactly 7
times larger than the real size

14/04/15 09:10:55 INFO storage.MemoryStore: Block rdd_0_0 stored as values
to memory (estimated size 28.0 MB, free 862.9 MB)
14/04/15 09:10:55 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_0_0 in memory on ubuntu.local:59962 (size: 28.0 MB, free: 862.9
MB)

14/04/15 09:10:56 INFO storage.MemoryStore: Block rdd_1_0 stored as values
to memory (estimated size 112.0 MB, free 750.9 MB)
14/04/15 09:10:56 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_1_0 in memory on ubuntu.local:59962 (size: 112.0 MB, free: 750.9
MB)

but when i try it in the spark shell, the estimated size is almost equal to
real size

14/04/15 09:23:27 INFO MemoryStore: Block rdd_0_0 stored as values to memory
(estimated size 4.2 MB, free 292.7 MB)
14/04/15 09:23:27 INFO BlockManagerMasterActor$BlockManagerInfo: Added
rdd_0_0 in memory on ubuntu.local:54071 (size: 4.2 MB, free: 292.7 MB)

14/04/15 09:27:40 INFO MemoryStore: Block rdd_1_0 stored as values to memory
(estimated size 17.0 MB, free 275.8 MB)
14/04/15 09:27:40 INFO BlockManagerMasterActor$BlockManagerInfo: Added
rdd_1_0 in memory on ubuntu.local:54071 (size: 17.0 MB, free: 275.8 MB)

who knows the reason?
i'm really confused about memory use in spark. 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n4251/memory.png> 

JVM and spark memory locate at different parts of system memory, the spark
code is executed in JVM memory, malloc operation like val e = new
Array[Int](2*size) /*8MB*/ use JVM memory. if not cached, generated RDDs are
writed back to disk, if cached, RDDs are copied to spark memory, is that
right?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/storage-MemoryStore-estimated-size-7-times-larger-than-real-tp4251.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

storage.MemoryStore estimated size 7 times larger than real

Reply via email to