Re: Setting up a recommender

Ted Dunning Mon, 21 Apr 2014 13:31:43 -0700

RowSimilarityJob is the guts of the work, but ItemSimilarityJob is usually
easier packaging for users.





On Mon, Apr 21, 2014 at 1:00 PM, Pat Ferrel <[email protected]> wrote:

> Yes the cooccurrence item similarity matrix is calculated using LLR using
> Mahout’s RowSimilarityJob. I guess we are calling this and indicator matrix
> these days.
>
> The indicator matrix is then translated from a SequenceFile into a CSV (or
> other text delimited file) which looks like a list of itemIDs—tokens or
> terms in Solr parlance—for each item. These documents are indexed by Solr
> and the query is the user history.
>
> [B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is
> “multiplied” by the indicator matrix by using it as the Solr query against
> the indicator matrix, actually producing a cosine similarity ranked list of
> items.
>
> You have to squint a little to see the math. Any matrix product can be
> substituted with a row to column similarity metric assuming dimensionality
> is correct. So the product in all the equations should be interpreted as
> such. So to get recs for a user [B’B]h is done in two phases, one
> calculates [B’B] and one is a Solr query that adds the ‘h’ to the equation.
>
> In this project https://github.com/pferrel/solr-recommender both [B’B]
> and [A’B] are calculated, the later uses actual matrix multiply, since we
> did not have a cross-RSJ at the time. Now that we have a cross cooccurrence
> in the Spark Scala Mahout 2 stuff I’ll rewrite the code to use it.
>
> The cross indicator matrix allows you to use two different actions to
> predict a target action. So for example views that are similar to purchases
> can be used to recommend purchases. Take a look at the readme on github it
> has a quick review of the theory.
>
> BTW there is a video recommender site that demos some interesting uses of
> Solr to blend collaborative filtering recs with metadata. It even makes
> recs based of of your most recent detail views on the site. That last
> doesn’t work all that well because it is really a cross recommendation and
> that isn’t built into the site yet. https://guide.finderbots.com
>
>
> On Apr 21, 2014, at 12:11 PM, Frank Scholten <[email protected]>
> wrote:
>
> Pat and Ted: I am late to the party but this is very interesting!
>
> I am not sure I understand all the steps, though. Do you still create a
> cooccurrence matrix and compute LLR scores during this process or do you
> only compute matrix multiplication times the history vector: B'B * h and
> B'A * h?
>
> Cheers,
>
> Frank
>
>
> On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel <[email protected]> wrote:
>
> > I finally got some time to work on this and have a first cut at output to
> > Solr working on the github repo. It only works on 2-action input but I'll
> > have that cleaned up soon so it will work with one action. Solr indexing
> > has not been tested yet and the field names and/or types may need
> tweaking.
> >
> > It takes the result of the previous drop:
> > 1) DRMs for B (user history or B items action1) and A (user history of A
> > items action2)
> > 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
> >
> > There are two final outputs created using mapreduce but requiring 2
> > in-memory hashmaps. I think this will work on a cluster (the hashmaps are
> > instantiated on each node) but haven't tried yet. It orders items in #2
> > fields by strength of "link", which is the similarity value used in [B'B]
> > or [B'A]. It would be nice to order #1 by recency but there is no
> provision
> > for passing through timestamps at present so they are ordered by the
> > strength of preference. This is probably not useful and so can be
> ignored.
> > Ordering by recency might be useful for truncating queries by recency
> while
> > leaving the training data containing 100% of available history.
> >
> > 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
> > looks like this:
> > id,history_b,history_a
> > user1,iphone ipad,iphone ipad galaxy
> > ...
> >
> > 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
> > looks like this:
> > id,b_b_links,b_a_links
> > u1,iphone ipad,iphone ipad galaxy
> > …
> >
> > It may work on a cluster, I haven't tried yet. As soon as someone has
> some
> > large-ish sample log files I'll give them a try. Check the sample input
> > files in the resources dir for format.
> >
> > https://github.com/pferrel/solr-recommender
> >
> >
> > On Aug 13, 2013, at 10:17 AM, Pat Ferrel <[email protected]> wrote:
> >
> > When I started looking at this I was a bit skeptical. As a Search engine
> > Solr may be peerless, but as yet another NoSQL db?
> >
> > However getting further into this I see one very large benefit. It has
> one
> > feature that sets it completely apart from the typical NoSQL db. The type
> > of queries you do return fuzzy results--in the very best sense of that
> > word. The most interesting queries are based on similarity to some
> > exemplar. Results are returned in order of similarity strength, not
> ordered
> > by a sort field.
> >
> > Wherever similarity based queries are important I'll look at Solr first.
> > SolrJ looks like an interesting way to get Solr queries on POJOs. It's
> > probably at least an alternative to using docs and CSVs to import the
> data
> > from Mahout.
> >
> >
> >
> > On Aug 12, 2013, at 2:32 PM, Ted Dunning <[email protected]> wrote:
> >
> > Yes.  That would be interesting.
> >
> >
> >
> >
> > On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan <[email protected]> wrote:
> >
> >> A little digression: Might a Matrix implementation backed by a Solr
> index
> >> and uses SolrJ for querying help at all for the Solr recommendation
> >> approach?
> >>
> >> It supports multiple fields of String, Text, or boolean flags.
> >>
> >> Best
> >> Gokhan
> >>
> >>
> >> On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel <[email protected]>
> wrote:
> >>
> >>> Also a question about user history.
> >>>
> >>> I was planning to write these into separate directories so Solr could
> >>> fetch them from different sources but it occurs to me that it would be
> >>> better to join A and B by user ID and output a doc per user ID with
> > three
> >>> fields, id, A item history, and B item history. Other fields could be
> >> added
> >>> for users metadata.
> >>>
> >>> Sound correct? This is what I'll do unless someone stops me.
> >>>
> >>> On Aug 7, 2013, at 11:25 AM, Pat Ferrel <[email protected]> wrote:
> >>>
> >>> Once you have a sample or example of what you think the
> >>> "log file" version will look like, can you post it? It would be great
> to
> >>> have example lines for two actions with or without the same item IDs.
> >> I'll
> >>> make sure we can digest it.
> >>>
> >>> I thought more about the ingest part and I don't think the
> > one-item-space
> >>> is actually a problem. It just means one item dictionary. A and B will
> >> have
> >>> the right content, all I have to do is make sure the right ranks are
> >> input
> >>> to the MM,
> >>> Transpose, and RSJ. This in turn is only one extra count of the # of
> >> items
> >>> in A's item space. This should be a very easy change If my thinking is
> >>> correct.
> >>>
> >>>
> >>> On Aug 7, 2013, at 8:09 AM, Ted Dunning <[email protected]> wrote:
> >>>
> >>> On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel <[email protected]>
> > wrote:
> >>>
> >>>> 4) To add more metadata to the Solr output will be left to the
> consumer
> >>>> for now. If there is a good data set to use we can illustrate how to
> do
> >>> it
> >>>> in the project. Ted may have some data for this from musicbrainz.
> >>>
> >>>
> >>> I am working on this issue now.
> >>>
> >>> The current state is that I can bring in a bunch of track names and
> > links
> >>> to artist names and so on.  This would provide the basic set of items
> >>> (artists, genres, tracks and tags).
> >>>
> >>> There is a hitch in bringing in the data needed to generate the logs
> >> since
> >>> that part of MB is not Apache compatible.  I am working on that issue.
> >>>
> >>> Technically, the data is in a massively normalized relational form
> right
> >>> now, but it isn't terribly hard to denormalize into a form that we
> need.
> >>>
> >>>
> >>>
> >>
> >
> >
> >
> >
>
>

Re: Setting up a recommender

Reply via email to