@Colin-  you're asking the $1 million dollar question that a lot of people
are trying to do.  This was literally the #1 most-asked question in every
city on my recent world-wide meetup tour.

I've been pointing people to my old Databricks co-worker's
streaming-matrix-factorization project:
https://github.com/brkyvz/streaming-matrix-factorization  He got tired of
everyone asking about this - and cranked it out over a weekend.  Love that
guy, Burak!  :)

I've attempted (unsuccessfully, so far) to deploy exactly what you're
trying to do here:
https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/ml/TrainMFIncremental.scala

We're a couple pull requests away from making this happen.  You can see my
comments and open github issues for the remaining bits.

And this will be my focus in the next week or so as I prepare for an
upcoming conference.  Keep an eye on this repo if you'd like.

@Sean:  thanks for the link.  I knew Oryx was doing this somehow - and I
kept meaning to see how you were doing it.  I'll likely incorporate some of
your stuff into my final solution.


On Thu, Mar 10, 2016 at 3:35 PM, Sean Owen <so...@cloudera.com> wrote:

> While it isn't crazy, I am not sure how valid it is to build a model
> off of only a chunk of recent data and then merge it into another
> model in any direct way. They're not really sharing a basis, so you
> can't just average them.
>
> My experience with this aspect suggests you should try to update the
> existing model in place on the fly. In short, you figure out how much
> the new input ought to change your estimate of the (user,item)
> association. Positive interactions should increase it a bit, etc. Then
> you work out how the item vector would change if the user vector were
> fixed in order to accomplish that change, with a bit of linear
> algebra. Vice versa for user vector. Of course, those changes affect
> the rest of the matrix too but that's the 'approximate' bit.
>
> I so happen to have an implementation of this in the context of a
> Spark ALS model, though raw source code may be hard to read. If it's
> of interest we can discuss offline (or online here to the extent it's
> relevant to Spark users)
>
>
> https://github.com/OryxProject/oryx/blob/91004a03413eef0fdfd6e75a61b68248d11db0e5/app/oryx-app/src/main/java/com/cloudera/oryx/app/speed/als/ALSSpeedModelManager.java#L192
>
> On Thu, Mar 10, 2016 at 8:01 PM, Colin Woodbury <coli...@gmail.com> wrote:
> > Hi there, I'm wondering if it's possible (or feasible) to combine the
> > feature matrices of two MatrixFactorizationModels that share a user and
> > product set.
> >
> > Specifically, one model would be the "on-going" model, and the other is
> one
> > trained only on the most recent aggregation of some event data. My
> overall
> > goal is to try to approximate "online" training, as ALS doesn't support
> > streaming, and it also isn't possible to "seed" the ALS training process
> > with an already trained model.
> >
> > Since the two Models would share a user/product ID space, can their
> feature
> > matrices be merged? For instance via:
> >
> > 1. Adding feature vectors together for user/product vectors that appear
> in
> > both models
> > 2. Averaging said vectors instead
> > 3. Some other linear algebra operation
> >
> > Unfortunately, I'm fairly ignorant as to the internal mechanics of ALS
> > itself. Is what I'm asking possible?
> >
> > Thank you,
> > Colin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Reply via email to