Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Nick Pentreath Sat, 12 Apr 2014 05:38:11 -0700

I think having the option of seeding the factors from HDFS rather than
random is a good one (well, actually providing additional optional
arguments initialUserFactors and initialItemFactors as RDD[(Int,
Array[Double])])



On Mon, Apr 7, 2014 at 8:09 AM, Debasish Das <debasish.da...@gmail.com>wrote:

> Sorry not persist...I meant adding a user parameter k which does checkpoint
> after every k iterations...out of N ALS iterations...We have hdfs installed
> so not a big deal...is there an issue of adding this user parameter in
> ALS.scala ? If it is then I can add it to our internal branch...
>
> For me tipping k seems like 4...With 4 iterations I can write out the
> factors...if I run with 10 iterations, after 4 I can see that it restarts
> the sparse matrix partition...tries to run all the iterations over again
> and fails due to array index out of bound which does not seems like a real
> bug...
>
> Not sure if it can be reproduced in movielens as the dataset I have is 25M
> x 3M (and counting)...whille movielens is tall and thin....
>
> Another idea would be to give an option to restart ALS with previous
> factors...that way ALS core algorithm does not need to change and it might
> be more useful...and that way we can point to a location from where the old
> factors can be load...I think @sean used similar idea in Oryx
> generations...
>
> Let me know which way you guys prefer....I can add it in...
>
>
>
>
> On Sun, Apr 6, 2014 at 9:15 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
> > Btw, explicit ALS doesn't need persist because each intermediate
> > factor is only used once. -Xiangrui
> >
> > On Sun, Apr 6, 2014 at 9:13 PM, Xiangrui Meng <men...@gmail.com> wrote:
> > > The persist used in implicit ALS doesn't help StackOverflow problem.
> > > Persist doesn't cut lineage. We need to call count() and then
> > > checkpoint() to cut the lineage. Did you try the workaround mentioned
> > > in https://issues.apache.org/jira/browse/SPARK-958:
> > >
> > > "I tune JVM thread stack size to 512k via option -Xss512k and it
> works."
> > >
> > > Best,
> > > Xiangrui
> > >
> > > On Sun, Apr 6, 2014 at 10:21 AM, Debasish Das <
> debasish.da...@gmail.com>
> > wrote:
> > >> At the head I see persist option in implicitPrefs but more cases like
> > the
> > >> ones mentioned above why don't we use similar technique and take an
> > input
> > >> that which iteration should we persist in explicit runs as well ?
> > >>
> > >> for (iter <- 1 to iterations) {
> > >>         // perform ALS update
> > >>         logInfo("Re-computing I given U (Iteration
> %d/%d)".format(iter,
> > >> iterations))
> > >>         products = updateFeatures(users, userOutLinks, productInLinks,
> > >> partitioner, rank, lambda,
> > >>           alpha, YtY = None)
> > >>         logInfo("Re-computing U given I (Iteration
> %d/%d)".format(iter,
> > >> iterations))
> > >>         users = updateFeatures(products, productOutLinks, userInLinks,
> > >> partitioner, rank, lambda,
> > >>           alpha, YtY = None)
> > >>       }
> > >>
> > >> Say if I want to persist at every k iterations out of N iterations of
> > ALS
> > >> explicit, there shoud be an option to do that...implicit right now
> uses
> > >> persist at each iteration...
> > >>
> > >> Does this option make sense or you guys want this issue to be fixed
> in a
> > >> different way...
> > >>
> > >> I definitely see that for my 25M x 3M run, with 64 gb executor memory,
> > >> something is going wrong after 5-th iteration and I wanted to run for
> 10
> > >> iterations...
> > >>
> > >> So my k is 4/5 for this particular problem...
> > >>
> > >> I can ask for the PR after testing the fix on the dataset I have...I
> > will
> > >> also try to see if we can make such datasets public for more
> research...
> > >>
> > >> For the LDA problem mentioned earlier in this email chain, k is
> 10...NMF
> > >> can generate topics similar to LDA as well...Carrot2 project uses
> it...
> > >>
> > >>
> > >>
> > >> On Thu, Mar 27, 2014 at 3:20 PM, Debasish Das <
> debasish.da...@gmail.com
> > >wrote:
> > >>
> > >>> Hi Matei,
> > >>>
> > >>> I am hitting similar problems with 10 ALS iterations...I am running
> > with
> > >>> 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50
> > >>>
> > >>> The first iteration of flatMaps run fine which means that the memory
> > >>> requirements are good per iteration...
> > >>>
> > >>> If I do check-pointing on RDD, most likely rest 9 iterations will
> also
> > run
> > >>> fine and I will get the results...
> > >>>
> > >>> Is there a plan to add checkpoint option to ALS for such large
> > >>> factorization jobs ?
> > >>>
> > >>> Thanks.
> > >>> Deb
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Jan 28, 2014 at 11:10 PM, Matei Zaharia <
> > matei.zaha...@gmail.com>wrote:
> > >>>
> > >>>> That would be great to add. Right now it would be easy to change it
> to
> > >>>> use another Hadoop FileSystem implementation at the very least (I
> > think you
> > >>>> can just pass the URL for that), but for Cassandra you'd have to
> use a
> > >>>> different InputFormat or some direct Cassandra access API.
> > >>>>
> > >>>> Matei
> > >>>>
> > >>>> On Jan 28, 2014, at 5:02 PM, Evan Chan <e...@ooyala.com> wrote:
> > >>>>
> > >>>> > By the way, is there any plan to make a pluggable backend for
> > >>>> > checkpointing?   We might be interested in writing a, for example,
> > >>>> > Cassandra backend.
> > >>>> >
> > >>>> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan <
> > junluan....@intel.com>
> > >>>> wrote:
> > >>>> >> Hi all
> > >>>> >>
> > >>>> >> The description about this Bug submitted by Matei is as following
> > >>>> >>
> > >>>> >>
> > >>>> >> The tipping point seems to be around 50. We should fix this by
> > >>>> checkpointing the RDDs every 10-20 iterations to break the lineage
> > chain,
> > >>>> but checkpointing currently requires HDFS installed, which not all
> > users
> > >>>> will have.
> > >>>> >>
> > >>>> >> We might also be able to fix DAGScheduler to not be recursive.
> > >>>> >>
> > >>>> >>
> > >>>> >> regards,
> > >>>> >> Andrew
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> > --
> > >>>> > --
> > >>>> > Evan Chan
> > >>>> > Staff Engineer
> > >>>> > e...@ooyala.com  |
> > >>>>
> > >>>>
> > >>>
> >
>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to