You should simply use a snapshot built from HEAD of github.com/apache/sparkif you can. The key change is in MLlib and with any luck you can just replace that bit. See the PR I referenced.
Sure with enough memory you can get it to run even with the memory issue, but it could be hundreds of GB at your scale. Not sure I take the point about the JVM; you can give it 64GB of heap and executors can use that much, sure. You could reduce the number of features a lot to work around it too, or reduce the input size. (If anyone saw my blog post about StackOverflow and ALS -- that's why I snuck in a relatively paltry 40 features and pruned questions with <4 tags :) ) I don't think jblas has anything to do with it per se, and the allocation fails in Java code, not native code. This should be exactly what that PR I mentioned fixes. -- Sean Owen | Director, Data Science | London On Sun, Mar 16, 2014 at 11:48 AM, Debasish Das <debasish.da...@gmail.com>wrote: > Thanks Sean...let me get the latest code..do you know which PR was it ? > > But will the executors run fine with say 32 gb or 64 gb of memory ? Does > not JVM shows up issues when the max memory goes beyond certain limit... > > Also the failure is due to GC limits from jblas...and I was thinking that > jblas is going to call native malloc right ? May be 64 gb is not a big deal > then...I will try increasing to 32 and then 64... > > java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit > exceeded) > org.jblas.DoubleMatrix.<init>(DoubleMatrix.java:323) > org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:471) > org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:476) > com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366) > com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366) > scala.Array$.fill(Array.scala:267) > com.verizon.bigdata.mllib.recommendation.ALSQR.updateBlock(ALSQR.scala:366) > com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:346) > com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:345) > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32) > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242) > org.apache.spark.rdd.RDD.iterator(RDD.scala:233) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:32) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242) > org.apache.spark.rdd.RDD.iterator(RDD.scala:233) > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:32) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242) > org.apache.spark.rdd.RDD.iterator(RDD.scala:233) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242) > org.apache.spark.rdd.RDD.iterator(RDD.scala:233) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) > org.apache.spark.scheduler.Task.run(Task.scala:53) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) > > > > On Sun, Mar 16, 2014 at 11:42 AM, Sean Owen <so...@cloudera.com> wrote: > >> Are you using HEAD or 0.9.0? I know there was a memory issue fixed a few >> weeks ago that made ALS need a lot more memory than is needed. >> >> https://github.com/apache/incubator-spark/pull/629 >> >> Try the latest code. >> >> -- >> Sean Owen | Director, Data Science | London >> >> >> On Sun, Mar 16, 2014 at 11:40 AM, Debasish Das >> <debasish.da...@gmail.com>wrote: >> >>> Hi, >>> >>> I gave my spark job 16 gb of memory and it is running on 8 executors. >>> >>> The job needs more memory due to ALS requirements (20M x 1M matrix) >>> >>> On each node I do have 96 gb of memory and I am using 16 gb out of it. I >>> want to increase the memory but I am not sure what is the right way to do >>> that... >>> >>> On 8 executor if I give 96 gb it might be a issue due to GC... >>> >>> Ideally on 8 nodes, I would run with 48 executors and each executor will >>> get 16 gb of memory..Total 48 JVMs... >>> >>> Is it possible to increase executors per node ? >>> >>> Thanks. >>> Deb >>> >> >> >