Hi,

 I'm trying out the DIMSUM item similarity from github master commit
69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is:
Num items : 8860
Number of users : 5138702
Implicit 1.0 values
Running item similarity with threshold :0.5

I have a 2 slave spark cluster on EC2 with m3.xlarge (13G each)

I'm running out of heap space:

Exception in thread "handle-read-write-executor-1"
java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
    at org.apache.spark.network.nio.Message$.create(Message.scala:90)

while Spark is doing:

org.apache.spark.rdd.RDD.reduce(RDD.scala:865)
org.apache.spark.mllib.rdd.RDDFunctions.treeAggregate(RDDFunctions.scala:111)
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeColumnSummaryStatistics(RowMatrix.scala:379)
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:483)

The spark UI said the shuffle read on this task at that point had used:
162.6 MB

I run spark submit from the master like below:

./spark/bin/spark-submit --executor-memory 13G .... --master spark://ec2....

Just wanted to check this is expected as the matrix doesn't seem
excessively big. Is there some memory setting I am missing?

Thanks,

 Clive

Reply via email to