If all RDD elements within a partition contain pointers to a single shared object, Spark persists as expected when the RDD is small. However, if the RDD is more than *200 elements* then Spark reports requiring much more memory than it actually does. This becomes a problem for large RDDs, as Spark refuses to persist even though it can. Is this a bug or is there a feature that I'm missing? Cheers, Luke
*val* /n/ = ??? *class* Elem(*val* s:Array[Int]) *val* /rdd/ = /sc/.parallelize(/Seq/(1)).mapPartitions( _ => { *val* sharedArray = Array./ofDim/[Int](10000000) /// Should require ~40MB/ (1 to /n/).toIterator.map(_ => *new* Elem(sharedArray)) }).cache().count() /// force computation/ For n = 100: /MemoryStore: Block rdd_1_0 stored as values in memory (estimated size *38.1 MB*, free 898.7 MB)/ For n = 200: /MemoryStore: Block rdd_1_0 stored as values in memory (estimated size *38.2 MB*, free 898.7 MB)/ For n = 201: /MemoryStore: Block rdd_1_0 stored as values in memory (estimated size *76.7 MB*, free 860.2 MB)/ For n = 5000: /MemoryStore: *Not enough space to cache rdd_1_0 in memory!* (computed 781.3 MB so far)/ Note: For medium sized n (where n>200 but spark can still cache), the actual application memory still stays where it should - Spark just seems to vastly overreport how much memory it's using. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-RDDs-with-shared-memory-bug-or-feature-tp20596.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org