hing to
find similar documents.
Best,
Sebastian
On 13.07.2012 18:03, Pat Ferrel wrote:
I increased the timeout to 100 minutes and added another machine (does
the new machine matter in this case?). The job completed successfully.
You say the algorithm is non-scalable--did you mean it's not
know
what we are dealing with.
Jeff
On 7/13/12 3:58 PM, Pat Ferrel wrote:
OK but I can't find it. It doesn't seem to be listed on the "mahout"
CL help. Maybe there's some way to tell the script to execute an
arbitrary driver?
Anyway I just wrote a few lines to execut
ere a
better venue than the mahout list?
On 7/13/12 9:41 PM, Ken Krugler wrote:
Hi Pat,
On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote:
I also do clustering so that's an obvious optimization I just haven't gotten to
it yet (doing similar only on docs clustered together). I'm al
'll look deeper into MoreLikeThis. In our use case we'll be taking the
TFIDF terms weights from a doc and reweighting some terms based on a
user gesture.
On 7/17/12 8:22 PM, Ken Krugler wrote:
Hi Pat,
On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:
Intersting.
I have another re
ok
deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms
weights from a doc and reweighting some terms based on a user gesture.
On 7/17/12 8:22 PM, Ken Krugler wrote:
Hi Pat,
On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:
Intersting.
I have another requirement, which
Here is a quick walkthrough for doing kmeans clustering and looking at
the input and output.
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line
Be aware that some command line params have changed since it was written
for 0.6. For instance
I have the same requirement. the distances are scalars/magnitudes.
Without considering their direction you cannot assume what you say below.
Take a look at RowSimlarity. This calculates the distance for each
document to others and you can specify how many close ones to find. It
is not a nicely
I've been changing the hadoop/conf/mapred-site.xml :
mapred.child.java.opts
-Xmx2048m
map heap size for child task
This ups the task heap to 2G.
On 7/26/12 7:12 PM, Lance Norskog wrote:
Increase the memory size or split the file!
On Thu, Jul 26, 2012 at 5:37 AM, pricila r
You may want to look at Bixo (openbixo.org), which is a web crawler
built on hadoop.
There is a little extension to it that parses into plain text using
boilerpipe (removes boilerplate text from pages) and Tika. The cralwer
will take a list of URLs and filter them with regex's (in or out). It
I need to create groups of items that are similar to a seed item. This
seed item may be a synthetic vector or may be based on a real document
but it is known before the group is created. It may also contain
weighted features that are not terms. There are several ways to do this
mentioned below.
We do what Ted describes by tossing frequently used terms with the IDF max,
tossing stop words and stemming with a lucene analyzer. The stemming makes the
tags less readable for sure but without it the near duplicate terms make for a
strange looking tag list. With or without stemming the top TFI
The way back from stem to tag is interesting from the standpoint of making tags
human readable. I had assumed a lookup but this seems much more satisfying and
flexible. In order to keep frequencies it will take something like a dictionary
creation step in the analyzer. This in turn seems to impl
Quoth Grant Ingersoll:
> To put this in bin/mahout speak, this would look like, munging some names and
> taking liberties with the actual argument to be passed in:
>
> bin/mahout svd (original -> svdOut)
> bin/mahout cleansvd ...
> bin/mahout transpose svdOut -> svdT
> bin/mahout transpose o
hastic+Singular+Value+Decomposition
should help to clarify outputs and usage.
On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov wrote:
> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel wrote:
>> Quoth Grant Ingersoll:
>>> To put this in bin/mahout speak, this would look like, m
Stochastic+Singular+Value+Decomposition
should help to clarify outputs and usage.
On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov wrote:
> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel wrote:
>> Quoth Grant Ingersoll:
>>> To put this in bin/mahout speak, this would look like, mungin
astic effects, but that is not what it is for really.
Assuming your input is m x n, can you tell me please what your m, n, k
and p are?
thanks.
-D
On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel wrote:
> There seems to be some internal constraint on k and/or p, which is making a
> test diffic
actually is
> before it actually iterates over it and runs into block size
> deficiency. So if you now m as an external knowledge, it is easy to
> avoid being trapped by block height defiicency.
>
>
> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel wrote:
>> This is only
.e. essentially
run a PCA space transformation on your data rather than just SVD)
-d
On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel wrote:
> Got it. Well on to some real and much larger data sets then…
>
> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov wrote:
>
> i think actu
Switching from API to CLI
the parameter -t is described in the PDF
--reduceTasks optional. The number of reducers to use (where
applicable): depends on the size of the hadoop cluster. At this point it could
also be overwritten by a standard hadoop property using -D option
4. Probably always
d be worth noticing, even if we don't know the actual error in it. To say
that your estimate of VR is valueless would require that we have some
experience with it, no?
On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov wrote:
On Aug 18, 2012 8:32 AM, "Pat Ferrel" wrote:
>
> Swi
between how I ran it from the API and from the CLI.
On Aug 18, 2012, at 7:29 PM, Pat Ferrel wrote:
-t Param
I'm no hadoop expert but there are a couple parameters for each node in a
cluster that specifies the default number of mappers and reducers for that
node. There is a rule of thumb
if it is using compression).
>
> I am not quite sure what you mean by "rowid" processing.
>
>
>
> On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel wrote:
>> Getting an odd error on SSVD.
>>
>> Starting with the QJob I get 9 map tasks for the data set
ug 20, 2012, at 8:23 AM, Dmitriy Lyubimov wrote:
On Aug 19, 2012 1:06 AM, "Pat Ferrel" wrote:
>
> -t Param
>
> I'm no hadoop expert but there are a couple parameters for each node in a
cluster that specifies the default number of mappers and reducers for that
node. Ther
3:42:45 PM PDT
To: Pat Ferrel
Bottom line, external tools need to arrive at the offset and the
solver just accept any offset (mean or otherwise) as a parameter with
this api: solver.setPcaMeanPath(xiPath)
xi is the mean (usually denoted by mu but in my working notes i had a
conflict with
Running on the local file system inside IDEA with MAHOUT_LOCAL set and
performing an SSVD I get the error below. Notice that R-m-0 exists in the
local file system and running it outside the debugger in pseudo-cluster mode
with HDFS works. Does SSVD work in local mode?
java.io.FileNotFoundEx
ot
> require Hadoop dependencies but it is a different api with no PCA
> option (not sure about power iterations).
>
> I am not sure why this very particular error appears in your setup.
>
> On Fri, Aug 31, 2012 at 3:02 PM, Pat Ferrel wrote:
>> Running on the local file syst
hadoop tmp based files.
On Sep 1, 2012, at 7:53 AM, Ted Dunning wrote:
With 57 crawled docs, you can't reasonably set p > 57. That is your second
error.
On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel wrote:
> I have a small data set that I am using in local mode for debugging
> purp
n your case.
On Sep 1, 2012 7:39 AM, "Pat Ferrel" wrote:
> I have a small data set that I am using in local mode for debugging
> purposes. The data is 57 crawled docs with something like 2200 terms. I run
> this through seq2sparse, then my own cloned version of rowid to get a
confusion between k and p (I was confused as well) you still
can't set the sum to more than the minimum size of your data. Here you
have set it larger. And it breaks.
On Sat, Sep 1, 2012 at 11:09 AM, Pat Ferrel wrote:
> Oh, sorry, below I meant to say k (the number to reduce to) not p.
>
11 AM, Dmitriy Lyubimov wrote:
Another guess i have is that perhaps you used relative paths when
specifying temp dir? Try to use absoute paths.
On Sep 1, 2012 10:09 AM, "Pat Ferrel" wrote:
> Yes, I understand why #2 failed. I guess I'm asking how to get this to
> succeed. Withou
Trying to do dimensionality reduction with SSVD then running the new doc matrix
through kmeans.
The Lanczos + ClusterDump test of SVD + kmeans uses A-hat = A^t V^t.
Unfortunately this results in anonymous vectors in clusteredPoints after A-hat
is run through kmeans. The doc ids are lost due to
--pca option.
As Ted suggests, you may also use US^0.5 which is already produced by
providing --uHalfSigma (or its embedded setter analog). the keys of
that output (produced by getUPath() call) will already contain your
Text document ids as sequence file keys.
-d
On Wed, Sep 5, 2012 at 5:20 PM
When using Laczos the recommendation is to use clean eigen vectors as a
distributed row matrix--call it V.
A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2.
Dmitriy and Ted recommend when using SSVD to do:
A-hat = US
When using PCA it's also preferable to use --uHalfSigma to crea
To reiterate the situation. In local mode using the local file system SSVD dies
with a file not found. In pseudo-cluster mode using hdfs SSVD on the same data
it runs correctly. All the rest of the analysis pipeline works fine in either
mode. I am using local mode to debug my surrounding code.
rything
> else works there just the same.
>
> That said, you can disable DistributedCache in some cases using
> SSVDSolver#setBroadcast(false). (in spite of what javadoc says, it is
> enabled by default... my bad).
>
> On Thu, Sep 6, 2012 at 11:18 AM, Pat Ferrel wrote:
>
U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
proportions for doing clustering and similarity (though not sure if this is
strictly required with cosine distance) I probably want to use U * Sigma
inste
easy patch.
On Sep 7, 2012 9:11 AM, "Pat Ferrel" wrote:
> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>
> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
> proportions for doing clustering and similarity (though n
ically, the way it works, Q matrix inherits keys of A rows
(BtJob line 137), and U inherits keys of Q (UJob line 128).
On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov wrote:
> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel wrote:
>> OK, U * Sigma seems to be working in the patch of SSVD
Regarding SSVD + clustering
I tried the command line version of kmeans on U*Sigma and don't get row IDs in
clusteredPoints there either. Using the command line kmeans on the input matrix
A does generate row IDs. There must be some difference in the two that causes
this to happen.
I used seq2s
Running kmeans on doc vectors turned into a DistributedRowMatrix works fine (no
surprise).
But when I do an SSVD on the above input, then create U * Sigma, a
DistributedRowMatrix (IntWritable, VectorWritable) I get clusters in
clusters-xx-final but in clusteredPoints the vectors have no IDs. Th
Maybe I should reword this since it has nothing to do with SSVD.
When doing clustering and asking the driver to cluster the input vectors after
the clusters are computed it creates a file called clusteredPoints/part-m-xxx
In it are cluster IDs and input vector pairs (IntWritable, VectorWritable)
200 iterations?
What is your convergence delta? If it is too small for your distance measure
you will perform all 200 iterations, every time you cluster.
--convergenceDelta (-cd) convergenceDelta
The convergence delta value.
Default is 0.5
I wo
What distance measure?
On Sep 12, 2012, at 10:37 PM, Elaine Gan wrote:
My -cd was quite loose, set it at 0.1
Hmm.. maybe the data is too small, causing the low performance..?
> 200 iterations?
>
> What is your convergence delta? If it is too small for your distance measure
> you will perfor
then look at the data, tune your other parameters,
scrub you input etc. before tightening your delta. If it takes 6 hours to
cluster then tuning your other params will take too long so do them first.
On Sep 13, 2012, at 7:59 AM, Pat Ferrel wrote:
What distance measure?
On Sep 12, 2012, at 10
Very nice post. Thanks.
I wonder if another problem that could benefit from the same approach is
finding Cluster names. Image finding the most important sentence of the cluster
instead of for a single doc using the same methods (break docs into sentences
etc). Then use parts of speech to conden
Let me go out on a limb and explain my understanding in layman's terms,
hopefully someone will correct me where I have erred...
What Dmitriy describes below creates a matrix "output". This is your original
matrix transformed into the new reduced dimensionality space. It will have a
row for eac
I wrote that doc and AFAIK you have to use the docIndex. The names are
preserved in the matrix file and are duped in the docIndex file so the issue is
not with the row id job. But the row similarity job strips the names from the
vectors it puts in named-similarity (using your dirs from below). A
The confusion here may be over the term "supervised"
Supervised classification assumes you know which group each user is in, and the
classifier builds a model to classify new users into the predefined groups.
Usually there is a classifier for each group that, when given a user vector,
return h
Using a boolean data model and log likelihood similarity I get recommendations
with strengths.
If I were using preference rating magnitudes the recommendation strength is
interpreted as the likely magnitude that a user would rate the recommendation.
Using the boolean model I get values approach
* however since it means results can't be ranked by preference value (all
are 1). So instead this returns a
* sum of similarities to any other user in the neighborhood who has also
rated the item.
*/
On Nov 15, 2012, at 9:59 AM, Pat Ferrel wrote:
Using a boolean data model and log l
ighted by count -- which is to say, it's a
sum of similarities. This isn't terribly principled but works reasonably in
practice. A simple average tends to over-weight unpopular items, but there
are likely better ways to account for that.
On Thu, Nov 15, 2012 at 5:59 PM, Pat Ferrel wrote
similarity, weighted by count -- which is to say, it's a
sum of similarities. This isn't terribly principled but works reasonably in
practice. A simple average tends to over-weight unpopular items, but there
are likely better ways to account for that.
On Thu, Nov 15, 2012 at 5:59 PM,
Trying to catch up.
Isn't the sum of similarities actually a globally comparable number for
strength of preference in a boolean model? I was thinking it wasn't but it is
really. It may not be ideal but as an ordinal it should work, right?
Is the logic behind the IDF idea that very popular items
I'm doing a very simple recommender based on binary data. Using
GenericRecommenderIRStatsEvaluator I get nDCG = NaN for each user. My data is
still very incomplete, which means an extremely low cooccurrence rate but there
are some since otherwise I'd expect P and R to be 0 and they are not. For
will have to decide what NaN
means.
I am happy to change that -- but would not pay attention to these
tests at this scale.
On Mon, Dec 3, 2012 at 7:55 PM, Pat Ferrel wrote:
> I'm doing a very simple recommender based on binary data. Using
> GenericRecommenderIRStatsEvaluator I g
does anyone know if mahout/examples/bin/factorize-movielens-1M.sh is still
working? CLI version of splitDataset is crashing in my build (latest trunk).
Even as in "mahout splitDataset" to get the params. Wouldn't be the first time
I mucked up a build though.
it complete correctly. Not
exactly sure how this is supposed to be done, it doesn't look like the options
get parsed in the super class automatically.
This will cause any invocation of splitDataset or DatasetSplitter to crash
running the current trunk.
On Dec 5, 2012, at 1:58 PM, Pat Ferre
What is the intuition regarding the choice or tuning of the ALS params?
Job-Specific Options:
--lambda lambda regularization
parameter
--implicitFeedback implicitFeedback
+1 this, found the same problems, same fixes. Haven't seem your last problem
On Jan 11, 2013, at 1:41 PM, Ying Liao wrote:
I am tring factorize-movielens-1M.sh. I first find a bug in the sh file.
Then I find a bug in org.apache.mahout.cf.taste.hadoop.als.DatasetSplitter,
the argMap is not mapped
elter wrote:
Which version/distribution of Hadoop are you using?
On 17.01.2013 16:08, Pat Ferrel wrote:
> +1 this, found the same problems, same fixes. Haven't seem your last problem
>
> On Jan 11, 2013, at 1:41 PM, Ying Liao wrote:
>
> I am tring factorize-movielens-1M.sh.
RE: Temporal effects. In CF you are interested in similarities. For instance in
a User-based CF recommender you want to detect users similar to a given user.
The time decay of the similarities is likely to be very slow. In other word if
I bought an iPad 1 and you bought an iPad 1, the similarity
mporal dynamics.
On Sat, Feb 2, 2013 at 9:54 AM, Pat Ferrel wrote:
> RE: Temporal effects. In CF you are interested in similarities. For
> instance in a User-based CF recommender you want to detect users similar to
> a given user. The time decay of the similarities is likely to be ve
2013 at 1:03 PM, Pat Ferrel wrote:
> Indeed, please elaborate. Not sure what you mean by "this is an important
> effect"
>
> Do you disagree with what I said re temporal decay?
>
No. I agree with it. Human relatedness decays much more quickly than item
popularity.
I
:
On Tue, Feb 5, 2013 at 11:29 AM, Pat Ferrel wrote:
> I think you meant: "Human relatedness decays much slower than item
> popularity."
>
Yes. Oops.
> So to make sure I understand the implications of using IDF… For
> boolean/implicit preferences the sum of all pref
The affect of downweighting the popular items is very similar to removing them
from recommendations so I still suspect precision will go down using IDF.
Obviously this can pretty easily be tested, I just wondered if anyone had
already done it.
This brings up a problem with holdout based precisi
I'd like to experiment with using several types of implicit preference values
with recommenders. I have purchases as an implicit pref of high strength. I'd
like to see if add-to-cart, view-product-details, impressions-seen, etc. can
increase offline precision in holdout tests. These less than ob
nt for the affect: you looked at certain items
and eventually purchased one and I looked at the same items so I might like
what you purchased. It also seems to work better in the existing mahout
framework.
On Feb 9, 2013, at 11:50 AM, Pat Ferrel wrote:
I'd like to experiment with using s
together but not as strongly as ought to
> be obvious from the fact that they're the same. Still, interesting
thought.
>
> There ought to be some 'signal' in this data, just a question of how much
> vs noise. A purchase means much more than a page view of course; it'
There are several methods for recommending things given a shopping cart
contents. At the risk of using the same tool for every problem I was thinking
about a recommender's use here.
I'd do something like train on shopping cart purchases so row = cartID, column
= itemID.
Given cart contents I co
53 AM, Pat Ferrel wrote:
> There are several methods for recommending things given a shopping cart
> contents. At the risk of using the same tool for every problem I was
> thinking about a recommender's use here.
>
> I'd do something like train on shopping cart purch
eas you've mentioned here. Given N items in a cart,
which next item most frequently occurs in a purchased cart?
On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel wrote:
> I thought you might say that but we don't have the add-to-cart action. We
> have to calculate cart purchases by ma
own version of it. Yes you are computing similarity
for k carted items by all N items, but is N so large? hundreds of
thousands of products? this is still likely pretty fast even if the
similarity is over millions of carts. Some smart precomputation and caching
goes a long way too.
On Thu, Feb 14
2013, at 6:09 PM, Ted Dunning wrote:
Do you see the contents of the cart?
Is the cart ID opaque? Does it persist as a surrogate for a user?
On Thu, Feb 14, 2013 at 10:30 AM, Pat Ferrel wrote:
> I thought you might say that but we don't have the add-to-cart action. We
> have t
Time splits are fine but may contain anomalies that bias the data. If you are
going to compare two recommenders based on time splits, make sure the data is
exactly the same for each recommender. One time split we did to create a 90-10
training to test set had a split date of 12/24! Some form of
be some 'signal' in this data, just a question of how much
> vs noise. A purchase means much more than a page view of course; it's not
> as subject to noise. Finding a means to use that info is probably going to
> help.
>
>
>
>
> On Sat, Feb 9, 2
My plan was to NOT use lucene to start with though I see the benefits. This is
because I want to experiment with weighting--doing idf, no weighting, and with
a non-log idf. Also I want to experiment with temporal decay of recomendability
and maybe blend item similarity based results in certain c
combined item recommendation matrix which is
roughly twice as much work as you need to do and it also doesn't let you
adjust weightings separately.
But it is probably the simplest way to get going with cross recommendation.
On Fri, Feb 22, 2013 at 9:48 AM, Pat Ferrel wrote:
> There
rm set of users to connect the items
together. When you compute the cooccurrence matrix you get A_1' A_1 + A_2'
A_2 which gives you recommendations from 1=>1 and from 2=>2, but no
recommendations 1=>2 or 2=>1. Thus, no cross recommendations.
On Sat, Feb 23, 2013 at 10
To pick up an old thread…
A = views items x users
B = purchases items x users
A cross recommender B'A h_v + B'B h_p = r_p
The B'B h_p is the basic boolean mahout recommender trained on purchases and
we'll use that implementation I assume.
B'A gives cooccurrences of views and purchases multiplyi
rong since view similarity unfiltered by
purchase is not ideal) or the cooccurrences in [B'A] and since this is not
symmetric it will matter whether I look at columns or rows. Either correspond
to item ids but similarities will be different.
Has anyone tried this sort of thing?
On Mar 19, 2
Taking an idea from Ted, I'm working on a cross recommender starting from
mahout's m/r implementation of an item-based recommender. We have purchases and
views for items by user. It is straightforward to create a recommender on
purchases but using views as a predictor of purchases does not work
to each row of the
>> input matrix. You can think of it as computing A'A and sparsifying the
>> result afterwards. Furthermore it allows to plug in a similarity measure
>> of your choice.
>>
>> If you want to have a cooccurrence matrix, you can use
>>
>
ed it. I will need to pass in the size of the matrices as the size of the
user and item space, Correct?
On Apr 3, 2013, at 9:15 AM, Pat Ferrel wrote:
The non-symmetry of the [B'A] and the fact that it is calculated from two
models leads me to a rather heavy handed approach at least for a
I guess I don't understand this issue.
In my case both the item ids and user ids of the separate DistributedRow Matrix
will match and I know the size for the entire space from a previous step where
I create id maps. I suppose you are saying the the m/r code would be super
simple if a row of B'
I need to do the equivalent of the xrecommender.mostSimilarItems(long[]
itemIDs, int howMany)
To over simplify this, in the standard Item-Based Recommender this is
equivalent to looking at the item similarities from the preference matrix
(similarity of item pruchases by user). In the xrecommen
like views and
purchases?
On Apr 8, 2013, at 2:31 PM, Ted Dunning wrote:
On Sat, Apr 6, 2013 at 3:26 PM, Pat Ferrel wrote:
> I guess I don't understand this issue.
>
> In my case both the item ids and user ids of the separate DistributedRow
> Matrix will match and I know th
to use Wikipedia articles (Myrrix, GraphLab).
Another idea is to use StackOverflow tags (Myrrix examples).
Although they are only good for emulating implicit feedback.
On Wed, Apr 10, 2013 at 6:48 PM, Ted Dunning wrote:
> On Wed, Apr 10, 2013 at 10:38 AM, Pat Ferrel
> wrote:
>
>&g
Getting this running with co-occurrence rather than using a similarity calc on
user rows finally forced me to understand what is going on in the base
recommender. And the answer implies further work.
[B'B] is usually not calculated in the usual item based recommender. The matrix
that comes out
Or you may want to look at recording purchases by user ID. Then use the
standard recommender to train on (userID, itemsID, boolean). Then query the
trained recommender thus: recommender.mostSimilarItems(long itemID, int
howMany) This does what you want but uses more data than just what items wer
Do you not have a user ID? No matter (though if you do I'd use it) you can use
the item ID as a surrogate for a user ID in the recommender. And there will be
no filtering if you ask for recommender.mostSimilarItems(long itemID, int
howMany), which has no user ID in the call and so will not filte
That looks like the best shortcut. It is one of the few places where the rows
of one and the columns of the other are seen together. Now I know why you
transpose the first input :-)
But, I have begun to wonder whether it is the right thing to do for a cross
recommender because you are comparing
esource.
>
> Robin
>
>
> On 4/10/13 8:37 PM, "Pat Ferrel" wrote:
>
>> I have retail data but can't publish results from it. If I could get a
>> public sample I'd share how the technique worked out.
>>
>> Not sure how to simulate
om/api-profiles/products-api
http://www.kaggle.com/c/acm-sf-chapter-hackathon-big/data
On Mon, Apr 15, 2013 at 2:03 PM, Pat Ferrel wrote:
> MAJOR may be too tame a word.
>
> Furthermore there are several enhancements the community could make to
> support retail data and retail recommen
k to
> view.
>
>
> On Tue, Apr 16, 2013 at 4:53 PM, Pat Ferrel wrote:
>
>> For the cross-recommender we need some replacement for a primary
>> action--purchases and a secondary action--views, clicks, impressions,
>> something.
>>
>> To use this da
u can infer
the search from the data, just not all search results.
On Apr 16, 2013, at 1:24 PM, Pat Ferrel wrote:
I think Ted is talking about a different application of this idea:
http://www.slideshare.net/tdunning/search-as-recommendation
The IDs in my case must be in the same space, at very
You always will have a "cold start" problem for a subset of users--the new ones
to a site. Popularity doesn't always work either. Sometimes you have a flat
purchase frequency distribution, as I've seen. In these cases a metadata or
content based recommender is nice to fill in. If you have no met
I'm doing an experiment creating a recommender from a Pinterest crawl I have
going. I have at least three actions that relate to recommendations:
Goal: recommend people you (a pinterest user) might want to follow
Actions mined by crawling:
follows (user, user)
followed by (user, user)
repinned
Using a Hadoop version of a Mahout recommender will create some number of recs
for all users as its output. Sean is talking about Myrrix I think which uses
factorization to get much smaller models and so can calculate the recs at
runtime for fairly large user sets.
However if you are using Maho
On May 19, 2013 6:27 PM, "Pat Ferrel" wrote:
> Using a Hadoop version of a Mahout recommender will create some number of
> recs for all users as its output. Sean is talking about Myrrix I think
> which uses factorization to get much smaller models and so can calculate
> the
no user data in the matrix. Or are you talking about using the user history as
the query? in which case you have to remember somewhere all users' history and
look it up for the query, no?
On May 19, 2013, at 8:09 PM, Ted Dunning wrote:
On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel wrote:
&
1 - 100 of 721 matches
Mail list logo