Need help getting around these errors. I have this program that runs fine on smaller input sizes. As it gets larger, Spark has increasing difficulty of being efficient and functioning without errors. We have about 46GB free on each node. The workers and executors are configured to use this up (the only way not to have Heap Space or GC overhead errors). On the driver, the data only uses 1.2GB RAM and is in the form of /matrix: RDD[(Integer, Array[Float])]/. It's a matrix that is column major with dimensions of 15k x 20k (columns). Each column takes about 4*15k = 60KB. 60KB*20k = 1.2GB. The data is not even that large. Eventually, I want to test 60k x 70k.
The Covariance Matrix algorithm we are using is basicly. O(N^3) At minimum, the outer loop needs to be parallelized. for each column i in matrix for each column j in matrix get the covariance between columns i and j Covariance is practically this. (no need to parallelize since we have enough work to do and this is small) for the two columns, get the sum of squares. O(N) Since I can't figure out a way to do permutation or nested for loop on RDD any other way, I had to call matrix.cartesian(matrix).map{ pair => ... }. I could do 5kx5k (1/4th of the work) using HashMap instead of RDD and finish in 10 sec. If I partition with 3k, it takes 18 hours. 300 takes 12 hours. 200 fails (error #1). 16 would be ideal (error #2). Note that I set the Akka frame size (spark-defaults.conf) to 15 to address some of the other errors with Akka. This is error #1 | | | | | | | | | | | | | | This is error 2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org