There is a general movement to allowing initial models to be specified for
Spark ML algorithms, so I'll add a JIRA to that task set. I should be able
to work on this as well as other ALS improvements.

Oh, another reason fold-in is typically not done in Spark is that for
models of any reasonable size, it is not really possible (or very
inefficient) to update a row (or a few rows) of a DF easily, so it's better
done in the serving layer, either in memory and/or using some database
(often a NoSQL store of some kind). Though I've often thought about doing
this in a Streaming job, and with the new state management it could work
much better.

On Fri, 11 Mar 2016 at 14:21 Sean Owen <so...@cloudera.com> wrote:

> On Fri, Mar 11, 2016 at 12:18 PM, Nick Pentreath
> <nick.pentre...@gmail.com> wrote:
> > In general, for serving situations MF models are stored in some other
> > serving system, so that system may be better suited to do the actual
> > fold-in. Sean's Oryx project does that, though I'm not sure offhand if
> that
> > part is done in Spark or not.
>
> (No this part isn't Spark; it's just manipulating arrays in memory.
> Making the model is done in Spark, as is marshalling the input from a
> Kafka topic.)
>
>
> > I know Sean's old Myrrix project also used to support computing ALS with
> an
> > initial set of input factors, so you could in theory incrementally
> compute
> > on new data. I'm not sure if the newer Oryx project supports it though.
>
> (Yes, exactly the same thing exists in oryx)
>
>
> > @Sean, what are your thoughts on supporting an initial model (factors) in
> > ALS? I personally have always just recomputed the model, but for very
> large
> > scale stuff it can make a lot of sense obviously. What I'm not sure on is
> > whether it gives good solutions (relative to recomputing) - I'd imagine
> it
> > will tend to find a slightly better local minimum given a previous local
> > minimum starting point... with the advantage that new users / items are
> > incorporated. But of course users can do a full recompute periodically.
>
> I'd prefer to be able to specify a model, since typically the initial
> model takes 20-40 iterations to converge to a reasonable state, and
> only needs a few more to converge to the same threshold given a
> relatively small number of additional inputs. The difference can be a
> lot of compute time.
>
> This is one of the few things that got worse when I moved to Spark
> since this capability was lost.
>
> I had been too lazy to actually implement it though. But that'd be cool.
>

Reply via email to