Scaling up spark Iitem similarity on big data data sets

jelmer Thu, 23 Jun 2016 06:48:20 -0700

Hi,

I am trying to build a simple recommendation engine using spark item
similarity (eg with
org.apache.mahout.math.cf.SimilarityAnalysis.cooccurrencesIDSs)


Things work fine on comparatively small dataset but I am having difficulty
scaling it up

The input I am using is CSV data containing 19.988.422 view item events
produced by 1.384.107 users. Looking at 5.135.845 distinct products

The csv data is stored on hdfs and is split up over 15 files, consequently
the resultant RDD will have 15 partitions.

After tweaking some parameters I did manage to get the job to run without
going out of memory but the job takes a very very long time to run

After running for 15 hours it still is stuck on

org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
org.apache.mahout.sparkbindings.blas.AtA$.at_a_nongraph_mmul(AtA.scala:254)
org.apache.mahout.sparkbindings.blas.AtA$.at_a(AtA.scala:61)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:325)
org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:339)
org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:123)
org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:41)
org.apache.mahout.math.drm.package$.drm2Checkpointed(package.scala:95)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:145)
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:143)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
scala.collection.AbstractIterator.toList(Iterator.scala:1157)


I am using spark on yarn and containers cannot use more than 16gb

I figured I would be able to speed things up by throwing a larger number of
executors at the problem. but so far that is not working out very well

I tried assigning 500 executors and repartitioning the input data to 500
partitions and even changing the spark.yarn.driver.memoryOverhead to crazy
values (half of the heap) did not resolve this.

Could someone offer any guidance on how to best speed up item similarity
jobs ?

Scaling up spark Iitem similarity on big data data sets

Reply via email to