Hi foks,

My Spark cluster has 8 machines, each of which has 377GB physical memory,
and thus the total maximum memory can be used for Spark is more than
2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
where the key is an integer and the value is an integer array with 43
elements.  Therefore, the memory cost of this raw dataset is [(1+43) *
1000000000 * 4] / (1024 * 1024 * 1024) = 164GB.  

Since I have to use this dataset repeatedly, I have to cache it in memory.
Some key parameter settings are: 
spark.storage.fraction=0.6
spark.driver.memory=30GB
spark.executor.memory=310GB.

But it failed on running a simple countByKey() and the error message is
"java.lang.OutOfMemoryError: Java heap space...". Does this mean a Spark
cluster of 2400+GB memory cannot keep 164GB raw data in memory? 

The codes of my program is as follows:

def main(args: Array[String]):Unit = {
    val sc = new SparkContext(new SparkConfig());
    
    val rdd = sc.parallelize(0 until 1000000000, 25600).map(i => (i, new
Array[Int](43))).cache();
    println("The number of keys is " + rdd.countByKey());

    //some other operations following here ...
}




To figure out the issue, I evaluated the memory cost of key-value pairs and
computed their memory cost using SizeOf.jar. The codes are as follows:

val arr = new Array[Int](43);
println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));
                
val tuple = (1, arr.clone);
println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));

The output is:
192.0b
992.0b


*Hard to believe, but it is true!! This result means, to store a key-value
pair, Tuple2 needs more than 5+ times memory than the simplest method with
array. Even though it may take 5+ times memory, its size is less than
1000GB, which is still much less than the total memory size of my cluster,
i.e., 2400+GB. I really do not understand why this happened.*

BTW, if the number of pairs is 1 million, it works well. If the arr contains
only 1 integer, to store a pair, Tuples needs around 10 times memory.

So I have some questions:
1. Why does Spark choose such a poor data structure, Tuple2, for key-value
pairs? Is there any better data structure for storing (key, value)  pairs
with less memory cost ?
2. Given a dataset with size of M, in general Spark how many times of memory
to handle it?


Best,
Landmark




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to