Sean - was this merged into the 0.9 branch as well (it seems so based
on the message from rxin). If so it might make sense to try out the
head of branch-0.9 as well. Unless there are *also* other changes
relevant to this in master.

- Patrick

On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen <so...@cloudera.com> wrote:
> You should simply use a snapshot built from HEAD of github.com/apache/spark
> if you can. The key change is in MLlib and with any luck you can just
> replace that bit. See the PR I referenced.
>
> Sure with enough memory you can get it to run even with the memory issue,
> but it could be hundreds of GB at your scale. Not sure I take the point
> about the JVM; you can give it 64GB of heap and executors can use that much,
> sure.
>
> You could reduce the number of features a lot to work around it too, or
> reduce the input size. (If anyone saw my blog post about StackOverflow and
> ALS -- that's why I snuck in a relatively paltry 40 features and pruned
> questions with <4 tags :) )
>
> I don't think jblas has anything to do with it per se, and the allocation
> fails in Java code, not native code. This should be exactly what that PR I
> mentioned fixes.
>
> --
> Sean Owen | Director, Data Science | London
>
>
> On Sun, Mar 16, 2014 at 11:48 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>>
>> Thanks Sean...let me get the latest code..do you know which PR was it ?
>>
>> But will the executors run fine with say 32 gb or 64 gb of memory ? Does
>> not JVM shows up issues when the max memory goes beyond certain limit...
>>
>> Also the failure is due to GC limits from jblas...and I was thinking that
>> jblas is going to call native malloc right ? May be 64 gb is not a big deal
>> then...I will try increasing to 32 and then 64...
>>
>> java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
>> exceeded)
>>
>> org.jblas.DoubleMatrix.<init>(DoubleMatrix.java:323)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:471)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:476)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)scala.Array$.fill(Array.scala:267)com.verizon.bigdata.mllib.recommendation.ALSQR.updateBlock(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:346)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:345)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)scala.collection.Iterator$$anon$11.next(Iterator.scala:328)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)org.apache.spark.scheduler.Task.run(Task.scala:53)org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>
>>
>>
>> On Sun, Mar 16, 2014 at 11:42 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> Are you using HEAD or 0.9.0? I know there was a memory issue fixed a few
>>> weeks ago that made ALS need a lot more memory than is needed.
>>>
>>> https://github.com/apache/incubator-spark/pull/629
>>>
>>> Try the latest code.
>>>
>>> --
>>> Sean Owen | Director, Data Science | London
>>>
>>>
>>> On Sun, Mar 16, 2014 at 11:40 AM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I gave my spark job 16 gb of memory and it is running on 8 executors.
>>>>
>>>> The job needs more memory due to ALS requirements (20M x 1M matrix)
>>>>
>>>> On each node I do have 96 gb of memory and I am using 16 gb out of it. I
>>>> want to increase the memory but I am not sure what is the right way to
>>>> do
>>>> that...
>>>>
>>>> On 8 executor if I give 96 gb it might be a issue due to GC...
>>>>
>>>> Ideally on 8 nodes, I would run with 48 executors and each executor will
>>>> get 16 gb of memory..Total  48 JVMs...
>>>>
>>>> Is it possible to increase executors per node ?
>>>>
>>>> Thanks.
>>>> Deb
>>>
>>>
>>
>

Reply via email to