Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with 43 elements. Therefore, the memory cost of this raw dataset is [(1+43) * 1000000000 * 4] / (1024 * 1024 * 1024) = 164GB.
Since I have to use this dataset repeatedly, I have to cache it in memory. Some key parameter settings are: spark.storage.fraction=0.6 spark.driver.memory=30GB spark.executor.memory=310GB. But it failed on running a simple countByKey() and the error message is "java.lang.OutOfMemoryError: Java heap space...". Does this mean a Spark cluster of 2400+GB memory cannot keep 164GB raw data in memory? The codes of my program is as follows: def main(args: Array[String]):Unit = { val sc = new SparkContext(new SparkConfig()); val rdd = sc.parallelize(0 until 1000000000, 25600).map(i => (i, new Array[Int](43))).cache(); println("The number of keys is " + rdd.countByKey()); //some other operations following here ... } To figure out the issue, I evaluated the memory cost of key-value pairs and computed their memory cost using SizeOf.jar. The codes are as follows: val arr = new Array[Int](43); println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr))); val tuple = (1, arr.clone); println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple))); The output is: 192.0b 992.0b *Hard to believe, but it is true!! This result means, to store a key-value pair, Tuple2 needs more than 5+ times memory than the simplest method with array. Even though it may take 5+ times memory, its size is less than 1000GB, which is still much less than the total memory size of my cluster, i.e., 2400+GB. I really do not understand why this happened.* BTW, if the number of pairs is 1 million, it works well. If the arr contains only 1 integer, to store a pair, Tuples needs around 10 times memory. So I have some questions: 1. Why does Spark choose such a poor data structure, Tuple2, for key-value pairs? Is there any better data structure for storing (key, value) pairs with less memory cost ? 2. Given a dataset with size of M, in general Spark how many times of memory to handle it? Best, Landmark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org