Re: User similarity in Mahout

Pat Ferrel Sun, 03 Jan 2016 09:34:32 -0800

Your problem will be that there isn’t enough cooccurrence between users since, 
well, how many jobs can any one user apply for and how likely is another user 
to apply for the same or overlapping jobs? The JDs have a short lifetime and so 
don’t lend themselves to the older single action recommenders. The 
cooccurrences you show below are probably optimistic. I know this from public 
statements made by CareerBuilder. Not to mention direct experience with a 
similar use case.

I’d expect collaborative filtering based on any one action, like "applying for 
a job" to give very poor results for you. CB tried this an got some decent 
results only  for people with a large number of applications—but this was a 
small % of cases.

Sooo, their solution was a content based recommender that basically matched 
resume’s to Job descriptions based on content similarity. To get this to work 
well you may need things like NLP to get named entities or at least a robust 
gazetteer that knows a large number of brand and technology names. There are 
also parsing services that will extract info from resume’s. This is a long and 
somewhat complicated path and has little to do with Mahout.

A much simpler path is to use cross-cooccurrence with the newer 
SimilarityAnalysis.cooccurrence part of Mahout-Samsara that runs on Spark. It 
will allow you to use many more user actions, ones that may give more overlap 
between user activity. This is collaborative filtering but can ingest user 
actions that are different from “apply”, and whose targets are not restricted 
to Job Descriptions.

In this case you have or may be able to collect the following indicators of 
user preference: 
1) user-id, “apply”, job-description-id: from actual application, this is what 
you want people to do—“apply” so it’s the closest indicator of user 
preference—assuming you don’t have information about whether they were accepted 
for a job, which might be even better.
2) user-id, “view”, job-description-id: from when a user reads the details of a 
JD
3) user-id, “category-preference”, category-id: again taken when a user “view”s 
a JD but the target of the action is the category of the JD, not the JD itself
4) user-id, “job-title-preference”, job-title-token: Take the job title and 
tokenize it, then feed in each token (minus stop words) as if they were “tags”. 
This could be taken when a user “view”s a JD
5) user-id, “other-JD-meta”, metadata-id: this could be anything about the JD 
that you know and is collected for users that “view” the JD. If you have tags, 
this would be a good way to use them.

You may also have user profile info taken from their resume, for instance their 
current job title, these can be encoded:
6) user-id, “current-title”, job-title: here it might be necessary to tokenize 
and feed each token in unless you have some standardized list of titles. This 
is taken when a user enters their information into your app.

The idea is to find many ways that users of your system can have data that is 
in common with other users. Then the recommender (I’ll describe next) will use 
a signal like “job-title-preference” or “view” even in cases where the user has 
never applied for a job and so would have none of the data you mention.

As far as I know the only end-to end, mostly off-the-shelf, implementation of 
this that uses Mahout is the Universal Recommender here: 
https://github.com/actionml/template-scala-parallel-universal-recommendation. 
It is built on the PredictionIO Framework described here: https://prediction.io
It supports any number of the “secondary” indicators—things like #2-#6, and is 
integrated with an event store and recommendation server. The Mahout docs for 
the command line version of cooccurrence analysis are here (in case you want to 
build your own framework): 
http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

I seriously doubt the older Mahout hadoop-based recommenders will help since 
they can only use one indicator.

> On Jan 3, 2016, at 7:01 AM, Peter K <[email protected]> wrote:
> 
> Hi all,
> 
> I'm trying to implement a recommender based 
> on Mahout to recommend jobs for users. 
> There are 2 actions - an user applied for a job or 
> viewed a job. In terms of weight I'm using 5 for 
> an apply and 2 for a view.
> 
> Now I'm trying to find best user similarity to capture 
> these relations.
> For example:
> User1 applied to jobs: J1,J2,J3,J4,J5
> User2 applied to jobs: J1,J2,J3,J4,J6
> User3 applied to jobs: J1, J7
> 
> When using Euclidean distance similarity if I'm not mistaken 
> users 2 and 3 are equal (when 
> calculating similarity to User1). But I feel User2 is more similar 
> and thus J6 should be 
> higher in the recommendations than J7.
> 
> Generally, I'm looking into more suggestions what algorithms 
> might be the best for this 
> case.
> 
> Thank you very much for any suggestions.
> 
> P.
>

Re: User similarity in Mahout

Reply via email to