Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Debasish Das Thu, 27 Mar 2014 15:22:00 -0700

Hi Matei,

I am hitting similar problems with 10 ALS iterations...I am running with 24
gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50


The first iteration of flatMaps run fine which means that the memory
requirements are good per iteration...

If I do check-pointing on RDD, most likely rest 9 iterations will also run
fine and I will get the results...

Is there a plan to add checkpoint option to ALS for such large
factorization jobs ?

Thanks.
Deb





On Tue, Jan 28, 2014 at 11:10 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> That would be great to add. Right now it would be easy to change it to use
> another Hadoop FileSystem implementation at the very least (I think you can
> just pass the URL for that), but for Cassandra you'd have to use a
> different InputFormat or some direct Cassandra access API.
>
> Matei
>
> On Jan 28, 2014, at 5:02 PM, Evan Chan <e...@ooyala.com> wrote:
>
> > By the way, is there any plan to make a pluggable backend for
> > checkpointing?   We might be interested in writing a, for example,
> > Cassandra backend.
> >
> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan <junluan....@intel.com>
> wrote:
> >> Hi all
> >>
> >> The description about this Bug submitted by Matei is as following
> >>
> >>
> >> The tipping point seems to be around 50. We should fix this by
> checkpointing the RDDs every 10-20 iterations to break the lineage chain,
> but checkpointing currently requires HDFS installed, which not all users
> will have.
> >>
> >> We might also be able to fix DAGScheduler to not be recursive.
> >>
> >>
> >> regards,
> >> Andrew
> >>
> >
> >
> >
> > --
> > --
> > Evan Chan
> > Staff Engineer
> > e...@ooyala.com  |
>
>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to