Hi, I'm trying out the DIMSUM item similarity from github master commit 69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is: Num items : 8860 Number of users : 5138702 Implicit 1.0 values Running item similarity with threshold :0.5
I have a 2 slave spark cluster on EC2 with m3.xlarge (13G each) I'm running out of heap space: Exception in thread "handle-read-write-executor-1" java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.network.nio.Message$.create(Message.scala:90) while Spark is doing: org.apache.spark.rdd.RDD.reduce(RDD.scala:865) org.apache.spark.mllib.rdd.RDDFunctions.treeAggregate(RDDFunctions.scala:111) org.apache.spark.mllib.linalg.distributed.RowMatrix.computeColumnSummaryStatistics(RowMatrix.scala:379) org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:483) The spark UI said the shuffle read on this task at that point had used: 162.6 MB I run spark submit from the master like below: ./spark/bin/spark-submit --executor-memory 13G .... --master spark://ec2.... Just wanted to check this is expected as the matrix doesn't seem excessively big. Is there some memory setting I am missing? Thanks, Clive