Hi again,
seeing the answers to this question and the other I had posted ("adjusted
cosine similarity for item-based recommender?"), I think I should clarify a bit
what I'm trying to achieve and why I (believe I should) do things the way I'm
doing.
I'm doing a class called "Learning from User-Generated data". Our first
assignment deals with analysing the results of various types of recommenders.
I'll go as far as saying "old-school" recommenders, given the content of your
answers.
We have been introduced to:
* Memory based:
- user-based
- item-based (*with* adjusted cosine similarity!)
- slope-one
- graph-based transitivity
* Memory based
- preprocessed item/user based (? this is unclear to me but I didn't reach
this part of the assignment so I'll search for information before I ask
questions; I also found an article where they mentioned slope-one amongst the
model based; I guess I'll need to do more research on this)
- matrix factorization-based (I saw that SVD is available in Mahout; my
project partner is looking into that right now)
We have a *static* training dataset (800.000 <user,movie,preference> triples)
and another static dataset for which we have to extract the predicted
preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e.
recompose the <user,movie,preference> triples). Note that this will never go in
a production environment, as it is merely a university requirement. For the
same reason, I would prefer not to mix up things too much and I'd rather do a
step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and
check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit
too much to handle at once with the deadline we were given).
So if I might get back to my original questions (again, I'm sorry for being
stubborn but I'm under specific constraints - I'll really try to understand the
search-based approach when I have more time) ;)
1. I'm guessing that to implement an adjusted cosine similarity I should extend
AbstractSimilarity (or maybe even AbstractRecommender?). Is this right?
2. I still can't believe that it takes more than at-most a few minutes to go
through my 200.000 lines and find the already calculated preference. What am I
doing wrong? :/ Should I store my whole datamodel in a file (how?) and then
read through the file? I don't see how this could be faster than just reading
the exact value I'm searching for...
Thanks again for your answers! Regards,
Pier Lorenzo
--------------------------------------------
On Fri, 4/3/15, Ted Dunning <[email protected]> wrote:
Subject: Re: fast performance way of writing preferences to file?
To: "[email protected]" <[email protected]>
Date: Friday, April 3, 2015, 5:52 PM
Are you sure that the
problem is writing the results? It seems to me that
the real problem is the use of a user-based
recommender.
For such a
small data set, for instance, a search-based recommender
will be
able to make recommendations in less
than a millisecond with multiple
recommendations possible in parallel. This
should allow you to do 200,000
recommendations in a few minutes on a single
machine.
With such a small
dataset, indicator-based methods may not be the best
option. To improve that, try using something
larger such as the million
song dataset.
See http://labrosa.ee.columbia.edu/millionsong/
Also, using and estimating
ratings is not a particularly good thing to be
doing if you want to build a real
recommender.
On
Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
[email protected]>
wrote:
> Hello
everyone,
> I'm new to mahout, to
recommender systems and to the mailing list.
>
> I''m trying
to find a (fast) way to write back preferences to a file.
I
> tried a few methods but I'm sure
there must be a better approach.
>
Here's the deal (you can find the same post in
stackoverflow[1]).
> I have a training
dataset of 800.000 records from 6000 users rating 3900
> movies. These are stored in a comma
separated file like:
>
userId,movieId,preference. I have another dataset (200.000
records) in the
> format: userId,movieId.
My goal is to use the first dataset as a
> training-set, in order to determine the
missing preferences of the second
>
set.
>
> So far, I
managed to load the training dataset and I generated
user-based
> recommendations. This is
pretty smooth and doesn't take too much time. But
> I'm struggling when it comes to
writing back the recommendations.
>
> The first method I tried is:
>
> * read a line from
the file and get the userId,movieId tuple.
> * retrieve the calculated preference
with estimatePreference(userId,
>
movieId)
> * append the preference to
the line and save it in a new file
> This
works, but it's incredibly slow (I added a counter to
print every
> 10.000th iteration: after a
couple of minutes it had only printed once. I
> have 8GB-RAM with an i7-core... how long
can it take to process 200.000
>
lines?!)
>
> My second
choise was:
>
> *
create a new FileDataModel with the second dataset
> * do something like this:
newDataModel.setPreference(userId, movieId,
> recommender.estimatePreference(userId,
movieId));
>
> Here I
get several problems:
> * at runtime:
java.lang.UnsupportedOperationException (as I found out
in
> [2], FileDataModel actually
can't be updated. I don't understand why the
> function setPreference exists in the first
place...)
> * The API of
FileDataModel#setPreference states "This method should
also
> be considered relatively
slow."
>
> I read
around that a solution would be to use delta files, but I
couldn't
> find out what that
actually means. Any suggestion on how I could speed up
> my writing-the-preferences process?
> Thank you!
>
> Pier Lorenzo
>
>
> [1]
> http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
> [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
>