A number of comments: 310GB is probably too large for an executor. You probably want many smaller executors per machine. But this is not your problem.
You didn't say where the OutOfMemoryError occurred. Executor or driver? Tuple2 is a Scala type, and a general type. It is appropriate for general pairs. You're asking about optimizing for a primitive array, yes, but of course Spark handles other types. I don't quite understand your test result. An array doesn't change size because it's referred to in a Tuple2. You are still dealing with a primitive array. There is no general answer to your question. Usually you have to consider the overhead of Java references, which does matter significantly, but there is no constant multiplier of course. It's up to you if it matters to implement more efficient data structures. Here however you're using just about the most efficient rep of an array of integers. I think you have plenty of memory in general, so the question is what was throwing the memory error? I'd also confirm that the configuration your executors actually used is what you expect to rule out config problems. On Fri, Feb 13, 2015 at 6:26 AM, Landmark <fangyixiang...@gmail.com> wrote: > Hi foks, > > My Spark cluster has 8 machines, each of which has 377GB physical memory, > and thus the total maximum memory can be used for Spark is more than > 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, > where the key is an integer and the value is an integer array with 43 > elements. Therefore, the memory cost of this raw dataset is [(1+43) * > 1000000000 * 4] / (1024 * 1024 * 1024) = 164GB. > > Since I have to use this dataset repeatedly, I have to cache it in memory. > Some key parameter settings are: > spark.storage.fraction=0.6 > spark.driver.memory=30GB > spark.executor.memory=310GB. > > But it failed on running a simple countByKey() and the error message is > "java.lang.OutOfMemoryError: Java heap space...". Does this mean a Spark > cluster of 2400+GB memory cannot keep 164GB raw data in memory? > > The codes of my program is as follows: > > def main(args: Array[String]):Unit = { > val sc = new SparkContext(new SparkConfig()); > > val rdd = sc.parallelize(0 until 1000000000, 25600).map(i => (i, new > Array[Int](43))).cache(); > println("The number of keys is " + rdd.countByKey()); > > //some other operations following here ... > } > > > > > To figure out the issue, I evaluated the memory cost of key-value pairs and > computed their memory cost using SizeOf.jar. The codes are as follows: > > val arr = new Array[Int](43); > println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr))); > > val tuple = (1, arr.clone); > println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple))); > > The output is: > 192.0b > 992.0b > > > *Hard to believe, but it is true!! This result means, to store a key-value > pair, Tuple2 needs more than 5+ times memory than the simplest method with > array. Even though it may take 5+ times memory, its size is less than > 1000GB, which is still much less than the total memory size of my cluster, > i.e., 2400+GB. I really do not understand why this happened.* > > BTW, if the number of pairs is 1 million, it works well. If the arr contains > only 1 integer, to store a pair, Tuples needs around 10 times memory. > > So I have some questions: > 1. Why does Spark choose such a poor data structure, Tuple2, for key-value > pairs? Is there any better data structure for storing (key, value) pairs > with less memory cost ? > 2. Given a dataset with size of M, in general Spark how many times of memory > to handle it? > > > Best, > Landmark > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org