There is a general movement to allowing initial models to be specified for Spark ML algorithms, so I'll add a JIRA to that task set. I should be able to work on this as well as other ALS improvements.
Oh, another reason fold-in is typically not done in Spark is that for models of any reasonable size, it is not really possible (or very inefficient) to update a row (or a few rows) of a DF easily, so it's better done in the serving layer, either in memory and/or using some database (often a NoSQL store of some kind). Though I've often thought about doing this in a Streaming job, and with the new state management it could work much better. On Fri, 11 Mar 2016 at 14:21 Sean Owen <so...@cloudera.com> wrote: > On Fri, Mar 11, 2016 at 12:18 PM, Nick Pentreath > <nick.pentre...@gmail.com> wrote: > > In general, for serving situations MF models are stored in some other > > serving system, so that system may be better suited to do the actual > > fold-in. Sean's Oryx project does that, though I'm not sure offhand if > that > > part is done in Spark or not. > > (No this part isn't Spark; it's just manipulating arrays in memory. > Making the model is done in Spark, as is marshalling the input from a > Kafka topic.) > > > > I know Sean's old Myrrix project also used to support computing ALS with > an > > initial set of input factors, so you could in theory incrementally > compute > > on new data. I'm not sure if the newer Oryx project supports it though. > > (Yes, exactly the same thing exists in oryx) > > > > @Sean, what are your thoughts on supporting an initial model (factors) in > > ALS? I personally have always just recomputed the model, but for very > large > > scale stuff it can make a lot of sense obviously. What I'm not sure on is > > whether it gives good solutions (relative to recomputing) - I'd imagine > it > > will tend to find a slightly better local minimum given a previous local > > minimum starting point... with the advantage that new users / items are > > incorporated. But of course users can do a full recompute periodically. > > I'd prefer to be able to specify a model, since typically the initial > model takes 20-40 iterations to converge to a reasonable state, and > only needs a few more to converge to the same threshold given a > relatively small number of additional inputs. The difference can be a > lot of compute time. > > This is one of the few things that got worse when I moved to Spark > since this capability was lost. > > I had been too lazy to actually implement it though. But that'd be cool. >