I think having the option of seeding the factors from HDFS rather than random is a good one (well, actually providing additional optional arguments initialUserFactors and initialItemFactors as RDD[(Int, Array[Double])])
On Mon, Apr 7, 2014 at 8:09 AM, Debasish Das <debasish.da...@gmail.com>wrote: > Sorry not persist...I meant adding a user parameter k which does checkpoint > after every k iterations...out of N ALS iterations...We have hdfs installed > so not a big deal...is there an issue of adding this user parameter in > ALS.scala ? If it is then I can add it to our internal branch... > > For me tipping k seems like 4...With 4 iterations I can write out the > factors...if I run with 10 iterations, after 4 I can see that it restarts > the sparse matrix partition...tries to run all the iterations over again > and fails due to array index out of bound which does not seems like a real > bug... > > Not sure if it can be reproduced in movielens as the dataset I have is 25M > x 3M (and counting)...whille movielens is tall and thin.... > > Another idea would be to give an option to restart ALS with previous > factors...that way ALS core algorithm does not need to change and it might > be more useful...and that way we can point to a location from where the old > factors can be load...I think @sean used similar idea in Oryx > generations... > > Let me know which way you guys prefer....I can add it in... > > > > > On Sun, Apr 6, 2014 at 9:15 PM, Xiangrui Meng <men...@gmail.com> wrote: > > > Btw, explicit ALS doesn't need persist because each intermediate > > factor is only used once. -Xiangrui > > > > On Sun, Apr 6, 2014 at 9:13 PM, Xiangrui Meng <men...@gmail.com> wrote: > > > The persist used in implicit ALS doesn't help StackOverflow problem. > > > Persist doesn't cut lineage. We need to call count() and then > > > checkpoint() to cut the lineage. Did you try the workaround mentioned > > > in https://issues.apache.org/jira/browse/SPARK-958: > > > > > > "I tune JVM thread stack size to 512k via option -Xss512k and it > works." > > > > > > Best, > > > Xiangrui > > > > > > On Sun, Apr 6, 2014 at 10:21 AM, Debasish Das < > debasish.da...@gmail.com> > > wrote: > > >> At the head I see persist option in implicitPrefs but more cases like > > the > > >> ones mentioned above why don't we use similar technique and take an > > input > > >> that which iteration should we persist in explicit runs as well ? > > >> > > >> for (iter <- 1 to iterations) { > > >> // perform ALS update > > >> logInfo("Re-computing I given U (Iteration > %d/%d)".format(iter, > > >> iterations)) > > >> products = updateFeatures(users, userOutLinks, productInLinks, > > >> partitioner, rank, lambda, > > >> alpha, YtY = None) > > >> logInfo("Re-computing U given I (Iteration > %d/%d)".format(iter, > > >> iterations)) > > >> users = updateFeatures(products, productOutLinks, userInLinks, > > >> partitioner, rank, lambda, > > >> alpha, YtY = None) > > >> } > > >> > > >> Say if I want to persist at every k iterations out of N iterations of > > ALS > > >> explicit, there shoud be an option to do that...implicit right now > uses > > >> persist at each iteration... > > >> > > >> Does this option make sense or you guys want this issue to be fixed > in a > > >> different way... > > >> > > >> I definitely see that for my 25M x 3M run, with 64 gb executor memory, > > >> something is going wrong after 5-th iteration and I wanted to run for > 10 > > >> iterations... > > >> > > >> So my k is 4/5 for this particular problem... > > >> > > >> I can ask for the PR after testing the fix on the dataset I have...I > > will > > >> also try to see if we can make such datasets public for more > research... > > >> > > >> For the LDA problem mentioned earlier in this email chain, k is > 10...NMF > > >> can generate topics similar to LDA as well...Carrot2 project uses > it... > > >> > > >> > > >> > > >> On Thu, Mar 27, 2014 at 3:20 PM, Debasish Das < > debasish.da...@gmail.com > > >wrote: > > >> > > >>> Hi Matei, > > >>> > > >>> I am hitting similar problems with 10 ALS iterations...I am running > > with > > >>> 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50 > > >>> > > >>> The first iteration of flatMaps run fine which means that the memory > > >>> requirements are good per iteration... > > >>> > > >>> If I do check-pointing on RDD, most likely rest 9 iterations will > also > > run > > >>> fine and I will get the results... > > >>> > > >>> Is there a plan to add checkpoint option to ALS for such large > > >>> factorization jobs ? > > >>> > > >>> Thanks. > > >>> Deb > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On Tue, Jan 28, 2014 at 11:10 PM, Matei Zaharia < > > matei.zaha...@gmail.com>wrote: > > >>> > > >>>> That would be great to add. Right now it would be easy to change it > to > > >>>> use another Hadoop FileSystem implementation at the very least (I > > think you > > >>>> can just pass the URL for that), but for Cassandra you'd have to > use a > > >>>> different InputFormat or some direct Cassandra access API. > > >>>> > > >>>> Matei > > >>>> > > >>>> On Jan 28, 2014, at 5:02 PM, Evan Chan <e...@ooyala.com> wrote: > > >>>> > > >>>> > By the way, is there any plan to make a pluggable backend for > > >>>> > checkpointing? We might be interested in writing a, for example, > > >>>> > Cassandra backend. > > >>>> > > > >>>> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan < > > junluan....@intel.com> > > >>>> wrote: > > >>>> >> Hi all > > >>>> >> > > >>>> >> The description about this Bug submitted by Matei is as following > > >>>> >> > > >>>> >> > > >>>> >> The tipping point seems to be around 50. We should fix this by > > >>>> checkpointing the RDDs every 10-20 iterations to break the lineage > > chain, > > >>>> but checkpointing currently requires HDFS installed, which not all > > users > > >>>> will have. > > >>>> >> > > >>>> >> We might also be able to fix DAGScheduler to not be recursive. > > >>>> >> > > >>>> >> > > >>>> >> regards, > > >>>> >> Andrew > > >>>> >> > > >>>> > > > >>>> > > > >>>> > > > >>>> > -- > > >>>> > -- > > >>>> > Evan Chan > > >>>> > Staff Engineer > > >>>> > e...@ooyala.com | > > >>>> > > >>>> > > >>> > > >