Re: RowSimilarity

2012-07-13 Thread Pat Ferrel
hing to find similar documents. Best, Sebastian On 13.07.2012 18:03, Pat Ferrel wrote: I increased the timeout to 100 minutes and added another machine (does the new machine matter in this case?). The job completed successfully. You say the algorithm is non-scalable--did you mean it's not

Re: Cluster Evaluation 0.8 style

2012-07-13 Thread Pat Ferrel
know what we are dealing with. Jeff On 7/13/12 3:58 PM, Pat Ferrel wrote: OK but I can't find it. It doesn't seem to be listed on the "mahout" CL help. Maybe there's some way to tell the script to execute an arbitrary driver? Anyway I just wrote a few lines to execut

Re: RowSimilarity

2012-07-14 Thread Pat Ferrel
ere a better venue than the mahout list? On 7/13/12 9:41 PM, Ken Krugler wrote: Hi Pat, On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote: I also do clustering so that's an obvious optimization I just haven't gotten to it yet (doing similar only on docs clustered together). I'm al

Re: RowSimilarity

2012-07-18 Thread Pat Ferrel
'll look deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms weights from a doc and reweighting some terms based on a user gesture. On 7/17/12 8:22 PM, Ken Krugler wrote: Hi Pat, On Jul 14, 2012, at 8:17am, Pat Ferrel wrote: Intersting. I have another re

Re: RowSimilarity

2012-07-19 Thread Pat Ferrel
ok deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms weights from a doc and reweighting some terms based on a user gesture. On 7/17/12 8:22 PM, Ken Krugler wrote: Hi Pat, On Jul 14, 2012, at 8:17am, Pat Ferrel wrote: Intersting. I have another requirement, which

Re: k-means output missing some cluster centers coordinates

2012-07-20 Thread Pat Ferrel
Here is a quick walkthrough for doing kmeans clustering and looking at the input and output. https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line Be aware that some command line params have changed since it was written for 0.6. For instance

Re: Find closest documents in cluster

2012-07-28 Thread Pat Ferrel
I have the same requirement. the distances are scalars/magnitudes. Without considering their direction you cannot assume what you say below. Take a look at RowSimlarity. This calculates the distance for each document to others and you can specify how many close ones to find. It is not a nicely

Re: ERROR: OutOfMemoryError: Java heap space

2012-07-28 Thread Pat Ferrel
I've been changing the hadoop/conf/mapred-site.xml : mapred.child.java.opts -Xmx2048m map heap size for child task This ups the task heap to 2G. On 7/26/12 7:12 PM, Lance Norskog wrote: Increase the memory size or split the file! On Thu, Jul 26, 2012 at 5:37 AM, pricila r

Re: Extracting data from websites

2012-07-30 Thread Pat Ferrel
You may want to look at Bixo (openbixo.org), which is a web crawler built on hadoop. There is a little extension to it that parses into plain text using boilerpipe (removes boilerplate text from pages) and Tika. The cralwer will take a list of URLs and filter them with regex's (in or out). It

RowSimilarity, Solr, or truncated clustering?

2012-07-30 Thread Pat Ferrel
I need to create groups of items that are similar to a seed item. This seed item may be a synthetic vector or may be based on a real document but it is known before the group is created. It may also contain weighted features that are not terms. There are several ways to do this mentioned below.

Re: Tags generation?

2012-08-03 Thread Pat Ferrel
We do what Ted describes by tossing frequently used terms with the IDF max, tossing stop words and stemming with a lucene analyzer. The stemming makes the tags less readable for sure but without it the near duplicate terms make for a strange looking tag list. With or without stemming the top TFI

Re: Tags generation?

2012-08-05 Thread Pat Ferrel
The way back from stem to tag is interesting from the standpoint of making tags human readable. I had assumed a lookup but this seems much more satisfying and flexible. In order to keep frequencies it will take something like a dictionary creation step in the analyzer. This in turn seems to impl

SSVD for dimensional reduction + Kmeans

2012-08-09 Thread Pat Ferrel
Quoth Grant Ingersoll: > To put this in bin/mahout speak, this would look like, munging some names and > taking liberties with the actual argument to be passed in: > > bin/mahout svd (original -> svdOut) > bin/mahout cleansvd ... > bin/mahout transpose svdOut -> svdT > bin/mahout transpose o

Re: SSVD for dimensional reduction + Kmeans

2012-08-09 Thread Pat Ferrel
hastic+Singular+Value+Decomposition should help to clarify outputs and usage. On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov wrote: > On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel wrote: >> Quoth Grant Ingersoll: >>> To put this in bin/mahout speak, this would look like, m

Re: SSVD for dimensional reduction + Kmeans

2012-08-10 Thread Pat Ferrel
Stochastic+Singular+Value+Decomposition should help to clarify outputs and usage. On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov wrote: > On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel wrote: >> Quoth Grant Ingersoll: >>> To put this in bin/mahout speak, this would look like, mungin

Re: SSVD for dimensional reduction + Kmeans

2012-08-10 Thread Pat Ferrel
astic effects, but that is not what it is for really. Assuming your input is m x n, can you tell me please what your m, n, k and p are? thanks. -D On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel wrote: > There seems to be some internal constraint on k and/or p, which is making a > test diffic

Re: SSVD for dimensional reduction + Kmeans

2012-08-10 Thread Pat Ferrel
actually is > before it actually iterates over it and runs into block size > deficiency. So if you now m as an external knowledge, it is easy to > avoid being trapped by block height defiicency. > > > On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel wrote: >> This is only

Re: SSVD for dimensional reduction + Kmeans

2012-08-10 Thread Pat Ferrel
.e. essentially run a PCA space transformation on your data rather than just SVD) -d On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel wrote: > Got it. Well on to some real and much larger data sets then… > > On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov wrote: > > i think actu

SSVD + PCA

2012-08-18 Thread Pat Ferrel
Switching from API to CLI the parameter -t is described in the PDF --reduceTasks optional. The number of reducers to use (where applicable): depends on the size of the hadoop cluster. At this point it could also be overwritten by a standard hadoop property using -D option 4. Probably always

Re: SSVD + PCA

2012-08-19 Thread Pat Ferrel
d be worth noticing, even if we don't know the actual error in it. To say that your estimate of VR is valueless would require that we have some experience with it, no? On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov wrote: On Aug 18, 2012 8:32 AM, "Pat Ferrel" wrote: > > Swi

Re: SSVD + PCA

2012-08-20 Thread Pat Ferrel
between how I ran it from the API and from the CLI. On Aug 18, 2012, at 7:29 PM, Pat Ferrel wrote: -t Param I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. There is a rule of thumb

Re: SSVD + PCA

2012-08-20 Thread Pat Ferrel
if it is using compression). > > I am not quite sure what you mean by "rowid" processing. > > > > On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel wrote: >> Getting an odd error on SSVD. >> >> Starting with the QJob I get 9 map tasks for the data set

Re: SSVD + PCA

2012-08-21 Thread Pat Ferrel
ug 20, 2012, at 8:23 AM, Dmitriy Lyubimov wrote: On Aug 19, 2012 1:06 AM, "Pat Ferrel" wrote: > > -t Param > > I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. Ther

Fwd: SSVD+PCA

2012-08-31 Thread Pat Ferrel
3:42:45 PM PDT To: Pat Ferrel Bottom line, external tools need to arrive at the offset and the solver just accept any offset (mean or otherwise) as a parameter with this api: solver.setPcaMeanPath(xiPath) xi is the mean (usually denoted by mu but in my working notes i had a conflict with

SSVD error

2012-08-31 Thread Pat Ferrel
Running on the local file system inside IDEA with MAHOUT_LOCAL set and performing an SSVD I get the error below. Notice that R-m-0 exists in the local file system and running it outside the debugger in pseudo-cluster mode with HDFS works. Does SSVD work in local mode? java.io.FileNotFoundEx

Re: SSVD error

2012-09-01 Thread Pat Ferrel
ot > require Hadoop dependencies but it is a different api with no PCA > option (not sure about power iterations). > > I am not sure why this very particular error appears in your setup. > > On Fri, Aug 31, 2012 at 3:02 PM, Pat Ferrel wrote: >> Running on the local file syst

Re: SSVD error

2012-09-01 Thread Pat Ferrel
hadoop tmp based files. On Sep 1, 2012, at 7:53 AM, Ted Dunning wrote: With 57 crawled docs, you can't reasonably set p > 57. That is your second error. On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel wrote: > I have a small data set that I am using in local mode for debugging > purp

Re: SSVD error

2012-09-01 Thread Pat Ferrel
n your case. On Sep 1, 2012 7:39 AM, "Pat Ferrel" wrote: > I have a small data set that I am using in local mode for debugging > purposes. The data is 57 crawled docs with something like 2200 terms. I run > this through seq2sparse, then my own cloned version of rowid to get a

Re: SSVD error

2012-09-01 Thread Pat Ferrel
confusion between k and p (I was confused as well) you still can't set the sum to more than the minimum size of your data. Here you have set it larger. And it breaks. On Sat, Sep 1, 2012 at 11:09 AM, Pat Ferrel wrote: > Oh, sorry, below I meant to say k (the number to reduce to) not p. >

Re: SSVD error

2012-09-01 Thread Pat Ferrel
11 AM, Dmitriy Lyubimov wrote: Another guess i have is that perhaps you used relative paths when specifying temp dir? Try to use absoute paths. On Sep 1, 2012 10:09 AM, "Pat Ferrel" wrote: > Yes, I understand why #2 failed. I guess I'm asking how to get this to > succeed. Withou

Re: PCA doc question for devs:

2012-09-05 Thread Pat Ferrel
Trying to do dimensionality reduction with SSVD then running the new doc matrix through kmeans. The Lanczos + ClusterDump test of SVD + kmeans uses A-hat = A^t V^t. Unfortunately this results in anonymous vectors in clusteredPoints after A-hat is run through kmeans. The doc ids are lost due to

Re: PCA doc question for devs:

2012-09-05 Thread Pat Ferrel
--pca option. As Ted suggests, you may also use US^0.5 which is already produced by providing --uHalfSigma (or its embedded setter analog). the keys of that output (produced by getUPath() call) will already contain your Text document ids as sequence file keys. -d On Wed, Sep 5, 2012 at 5:20 PM

Doing dimensionality reduction with SSVD and Lanczos

2012-09-06 Thread Pat Ferrel
When using Laczos the recommendation is to use clean eigen vectors as a distributed row matrix--call it V. A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2. Dmitriy and Ted recommend when using SSVD to do: A-hat = US When using PCA it's also preferable to use --uHalfSigma to crea

Re: SSVD error

2012-09-06 Thread Pat Ferrel
To reiterate the situation. In local mode using the local file system SSVD dies with a file not found. In pseudo-cluster mode using hdfs SSVD on the same data it runs correctly. All the rest of the analysis pipeline works fine in either mode. I am using local mode to debug my surrounding code.

Re: SSVD error

2012-09-06 Thread Pat Ferrel
rything > else works there just the same. > > That said, you can disable DistributedCache in some cases using > SSVDSolver#setBroadcast(false). (in spite of what javadoc says, it is > enabled by default... my bad). > > On Thu, Sep 6, 2012 at 11:18 AM, Pat Ferrel wrote: >

SSVD compute U * Sigma

2012-09-07 Thread Pat Ferrel
U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply". WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance proportions for doing clustering and similarity (though not sure if this is strictly required with cosine distance) I probably want to use U * Sigma inste

Re: SSVD compute U * Sigma

2012-09-07 Thread Pat Ferrel
easy patch. On Sep 7, 2012 9:11 AM, "Pat Ferrel" wrote: > U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply". > > WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance > proportions for doing clustering and similarity (though n

Re: SSVD compute U * Sigma

2012-09-07 Thread Pat Ferrel
ically, the way it works, Q matrix inherits keys of A rows (BtJob line 137), and U inherits keys of Q (UJob line 128). On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov wrote: > On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel wrote: >> OK, U * Sigma seems to be working in the patch of SSVD

Anonymous rows in clusters after SSVD

2012-09-09 Thread Pat Ferrel
Regarding SSVD + clustering I tried the command line version of kmeans on U*Sigma and don't get row IDs in clusteredPoints there either. Using the command line kmeans on the input matrix A does generate row IDs. There must be some difference in the two that causes this to happen. I used seq2s

Kmeans on SSVD output

2012-09-11 Thread Pat Ferrel
Running kmeans on doc vectors turned into a DistributedRowMatrix works fine (no surprise). But when I do an SSVD on the above input, then create U * Sigma, a DistributedRowMatrix (IntWritable, VectorWritable) I get clusters in clusters-xx-final but in clusteredPoints the vectors have no IDs. Th

Mapping clusteredPoints to clusters

2012-09-11 Thread Pat Ferrel
Maybe I should reword this since it has nothing to do with SSVD. When doing clustering and asking the driver to cluster the input vectors after the clusters are computed it creates a file called clusteredPoints/part-m-xxx In it are cluster IDs and input vector pairs (IntWritable, VectorWritable)

Re: Is mahout kmeans slow ?

2012-09-12 Thread Pat Ferrel
200 iterations? What is your convergence delta? If it is too small for your distance measure you will perform all 200 iterations, every time you cluster. --convergenceDelta (-cd) convergenceDelta The convergence delta value. Default is 0.5 I wo

Re: Is mahout kmeans slow ?

2012-09-13 Thread Pat Ferrel
What distance measure? On Sep 12, 2012, at 10:37 PM, Elaine Gan wrote: My -cd was quite loose, set it at 0.1 Hmm.. maybe the data is too small, causing the low performance..? > 200 iterations? > > What is your convergence delta? If it is too small for your distance measure > you will perfor

Re: Is mahout kmeans slow ?

2012-09-13 Thread Pat Ferrel
then look at the data, tune your other parameters, scrub you input etc. before tightening your delta. If it takes 6 hours to cluster then tuning your other params will take too long so do them first. On Sep 13, 2012, at 7:59 AM, Pat Ferrel wrote: What distance measure? On Sep 12, 2012, at 10

Re: Document summarization with lsa - blog post series

2012-09-17 Thread Pat Ferrel
Very nice post. Thanks. I wonder if another problem that could benefit from the same approach is finding Cluster names. Image finding the most important sentence of the cluster instead of for a single doc using the same methods (break docs into sentences etc). Then use parts of speech to conden

Re: How to use ssvd for dimensionality reduction of tfidf-vectors?

2012-10-24 Thread Pat Ferrel
Let me go out on a limb and explain my understanding in layman's terms, hopefully someone will correct me where I have erred... What Dmitriy describes below creates a matrix "output". This is your original matrix transformed into the new reduced dimensionality space. It will have a row for eac

Re: Preserving named vectors during rowsimilarity

2012-10-25 Thread Pat Ferrel
I wrote that doc and AFAIK you have to use the docIndex. The names are preserved in the matrix file and are duped in the docIndex file so the issue is not with the row id job. But the row similarity job strips the names from the vectors it puts in named-similarity (using your dirs from below). A

Re: need help on mahout

2012-11-09 Thread Pat Ferrel
The confusion here may be over the term "supervised" Supervised classification assumes you know which group each user is in, and the classifier builds a model to classify new users into the predefined groups. Usually there is a classifier for each group that, when given a user vector, return h

How to interpret recommendation strength

2012-11-15 Thread Pat Ferrel
Using a boolean data model and log likelihood similarity I get recommendations with strengths. If I were using preference rating magnitudes the recommendation strength is interpreted as the likely magnitude that a user would rate the recommendation. Using the boolean model I get values approach

Re: How to interpret recommendation strength

2012-11-15 Thread Pat Ferrel
* however since it means results can't be ranked by preference value (all are 1). So instead this returns a * sum of similarities to any other user in the neighborhood who has also rated the item. */ On Nov 15, 2012, at 9:59 AM, Pat Ferrel wrote: Using a boolean data model and log l

Re: How to interpret recommendation strength

2012-11-15 Thread Pat Ferrel
ighted by count -- which is to say, it's a sum of similarities. This isn't terribly principled but works reasonably in practice. A simple average tends to over-weight unpopular items, but there are likely better ways to account for that. On Thu, Nov 15, 2012 at 5:59 PM, Pat Ferrel wrote

Re: How to interpret recommendation strength

2012-11-15 Thread Pat Ferrel
similarity, weighted by count -- which is to say, it's a sum of similarities. This isn't terribly principled but works reasonably in practice. A simple average tends to over-weight unpopular items, but there are likely better ways to account for that. On Thu, Nov 15, 2012 at 5:59 PM,

Re: How to interpret recommendation strength

2012-11-15 Thread Pat Ferrel
Trying to catch up. Isn't the sum of similarities actually a globally comparable number for strength of preference in a boolean model? I was thinking it wasn't but it is really. It may not be ideal but as an ordinal it should work, right? Is the logic behind the IDF idea that very popular items

Recommender Evaluator

2012-12-03 Thread Pat Ferrel
I'm doing a very simple recommender based on binary data. Using GenericRecommenderIRStatsEvaluator I get nDCG = NaN for each user. My data is still very incomplete, which means an extremely low cooccurrence rate but there are some since otherwise I'd expect P and R to be 0 and they are not. For

Re: Recommender Evaluator

2012-12-03 Thread Pat Ferrel
will have to decide what NaN means. I am happy to change that -- but would not pay attention to these tests at this scale. On Mon, Dec 3, 2012 at 7:55 PM, Pat Ferrel wrote: > I'm doing a very simple recommender based on binary data. Using > GenericRecommenderIRStatsEvaluator I g

splitDataset

2012-12-05 Thread Pat Ferrel
does anyone know if mahout/examples/bin/factorize-movielens-1M.sh is still working? CLI version of splitDataset is crashing in my build (latest trunk). Even as in "mahout splitDataset" to get the params. Wouldn't be the first time I mucked up a build though.

Re: splitDataset crashes

2012-12-07 Thread Pat Ferrel
it complete correctly. Not exactly sure how this is supposed to be done, it doesn't look like the options get parsed in the super class automatically. This will cause any invocation of splitDataset or DatasetSplitter to crash running the current trunk. On Dec 5, 2012, at 1:58 PM, Pat Ferre

Parameter choice and tuning parallelALS

2013-01-02 Thread Pat Ferrel
What is the intuition regarding the choice or tuning of the ALS params? Job-Specific Options: --lambda lambda regularization parameter --implicitFeedback implicitFeedback

Re: is Hadoop based SVD_ALS a complete feature?

2013-01-17 Thread Pat Ferrel
+1 this, found the same problems, same fixes. Haven't seem your last problem On Jan 11, 2013, at 1:41 PM, Ying Liao wrote: I am tring factorize-movielens-1M.sh. I first find a bug in the sh file. Then I find a bug in org.apache.mahout.cf.taste.hadoop.als.DatasetSplitter, the argMap is not mapped

Re: is Hadoop based SVD_ALS a complete feature?

2013-01-17 Thread Pat Ferrel
elter wrote: Which version/distribution of Hadoop are you using? On 17.01.2013 16:08, Pat Ferrel wrote: > +1 this, found the same problems, same fixes. Haven't seem your last problem > > On Jan 11, 2013, at 1:41 PM, Ying Liao wrote: > > I am tring factorize-movielens-1M.sh.

Re: (near) real time recommender/predictor

2013-02-02 Thread Pat Ferrel
RE: Temporal effects. In CF you are interested in similarities. For instance in a User-based CF recommender you want to detect users similar to a given user. The time decay of the similarities is likely to be very slow. In other word if I bought an iPad 1 and you bought an iPad 1, the similarity

Re: (near) real time recommender/predictor

2013-02-02 Thread Pat Ferrel
mporal dynamics. On Sat, Feb 2, 2013 at 9:54 AM, Pat Ferrel wrote: > RE: Temporal effects. In CF you are interested in similarities. For > instance in a User-based CF recommender you want to detect users similar to > a given user. The time decay of the similarities is likely to be ve

Using IDF in CF recommender

2013-02-05 Thread Pat Ferrel
2013 at 1:03 PM, Pat Ferrel wrote: > Indeed, please elaborate. Not sure what you mean by "this is an important > effect" > > Do you disagree with what I said re temporal decay? > No. I agree with it. Human relatedness decays much more quickly than item popularity. I

Re: Using IDF in CF recommender

2013-02-06 Thread Pat Ferrel
: On Tue, Feb 5, 2013 at 11:29 AM, Pat Ferrel wrote: > I think you meant: "Human relatedness decays much slower than item > popularity." > Yes. Oops. > So to make sure I understand the implications of using IDF… For > boolean/implicit preferences the sum of all pref

Re: Using IDF in CF recommender

2013-02-06 Thread Pat Ferrel
The affect of downweighting the popular items is very similar to removing them from recommendations so I still suspect precision will go down using IDF. Obviously this can pretty easily be tested, I just wondered if anyone had already done it. This brings up a problem with holdout based precisi

Implicit preferences

2013-02-09 Thread Pat Ferrel
I'd like to experiment with using several types of implicit preference values with recommenders. I have purchases as an implicit pref of high strength. I'd like to see if add-to-cart, view-product-details, impressions-seen, etc. can increase offline precision in holdout tests. These less than ob

Re: Implicit preferences

2013-02-09 Thread Pat Ferrel
nt for the affect: you looked at certain items and eventually purchased one and I looked at the same items so I might like what you purchased. It also seems to work better in the existing mahout framework. On Feb 9, 2013, at 11:50 AM, Pat Ferrel wrote: I'd like to experiment with using s

Re: Implicit preferences

2013-02-12 Thread Pat Ferrel
together but not as strongly as ought to > be obvious from the fact that they're the same. Still, interesting thought. > > There ought to be some 'signal' in this data, just a question of how much > vs noise. A purchase means much more than a page view of course; it'

Shopping cart

2013-02-14 Thread Pat Ferrel
There are several methods for recommending things given a shopping cart contents. At the risk of using the same tool for every problem I was thinking about a recommender's use here. I'd do something like train on shopping cart purchases so row = cartID, column = itemID. Given cart contents I co

Re: Shopping cart

2013-02-14 Thread Pat Ferrel
53 AM, Pat Ferrel wrote: > There are several methods for recommending things given a shopping cart > contents. At the risk of using the same tool for every problem I was > thinking about a recommender's use here. > > I'd do something like train on shopping cart purch

Re: Shopping cart

2013-02-14 Thread Pat Ferrel
eas you've mentioned here. Given N items in a cart, which next item most frequently occurs in a purchased cart? On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel wrote: > I thought you might say that but we don't have the add-to-cart action. We > have to calculate cart purchases by ma

Re: Shopping cart

2013-02-14 Thread Pat Ferrel
own version of it. Yes you are computing similarity for k carted items by all N items, but is N so large? hundreds of thousands of products? this is still likely pretty fast even if the similarity is over millions of carts. Some smart precomputation and caching goes a long way too. On Thu, Feb 14

Re: Shopping cart

2013-02-14 Thread Pat Ferrel
2013, at 6:09 PM, Ted Dunning wrote: Do you see the contents of the cart? Is the cart ID opaque? Does it persist as a surrogate for a user? On Thu, Feb 14, 2013 at 10:30 AM, Pat Ferrel wrote: > I thought you might say that but we don't have the add-to-cart action. We > have t

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-17 Thread Pat Ferrel
Time splits are fine but may contain anomalies that bias the data. If you are going to compare two recommenders based on time splits, make sure the data is exactly the same for each recommender. One time split we did to create a 90-10 training to test set had a split date of 12/24! Some form of

Cross recommendation

2013-02-21 Thread Pat Ferrel
be some 'signal' in this data, just a question of how much > vs noise. A purchase means much more than a page view of course; it's not > as subject to noise. Finding a means to use that info is probably going to > help. > > > > > On Sat, Feb 9, 2

Re: Cross recommendation

2013-02-22 Thread Pat Ferrel
My plan was to NOT use lucene to start with though I see the benefits. This is because I want to experiment with weighting--doing idf, no weighting, and with a non-log idf. Also I want to experiment with temporal decay of recomendability and maybe blend item similarity based results in certain c

Re: Cross recommendation

2013-02-23 Thread Pat Ferrel
combined item recommendation matrix which is roughly twice as much work as you need to do and it also doesn't let you adjust weightings separately. But it is probably the simplest way to get going with cross recommendation. On Fri, Feb 22, 2013 at 9:48 AM, Pat Ferrel wrote: > There

Re: Cross recommendation

2013-02-24 Thread Pat Ferrel
rm set of users to connect the items together. When you compute the cooccurrence matrix you get A_1' A_1 + A_2' A_2 which gives you recommendations from 1=>1 and from 2=>2, but no recommendations 1=>2 or 2=>1. Thus, no cross recommendations. On Sat, Feb 23, 2013 at 10

[B'A] h_v cross recommender

2013-03-19 Thread Pat Ferrel
To pick up an old thread… A = views items x users B = purchases items x users A cross recommender B'A h_v + B'B h_p = r_p The B'B h_p is the basic boolean mahout recommender trained on purchases and we'll use that implementation I assume. B'A gives cooccurrences of views and purchases multiplyi

Re: [B'A] h_v cross recommender

2013-03-19 Thread Pat Ferrel
rong since view similarity unfiltered by purchase is not ideal) or the cooccurrences in [B'A] and since this is not symmetric it will matter whether I look at columns or rows. Either correspond to item ids but similarities will be different. Has anyone tried this sort of thing? On Mar 19, 2

cross recommender

2013-04-02 Thread Pat Ferrel
Taking an idea from Ted, I'm working on a cross recommender starting from mahout's m/r implementation of an item-based recommender. We have purchases and views for items by user. It is straightforward to create a recommender on purchases but using views as a predictor of purchases does not work

Re: cross recommender

2013-04-03 Thread Pat Ferrel
to each row of the >> input matrix. You can think of it as computing A'A and sparsifying the >> result afterwards. Furthermore it allows to plug in a similarity measure >> of your choice. >> >> If you want to have a cooccurrence matrix, you can use >> >

Re: cross recommender

2013-04-04 Thread Pat Ferrel
ed it. I will need to pass in the size of the matrices as the size of the user and item space, Correct? On Apr 3, 2013, at 9:15 AM, Pat Ferrel wrote: The non-symmetry of the [B'A] and the fact that it is calculated from two models leads me to a rather heavy handed approach at least for a

Re: cross recommender

2013-04-06 Thread Pat Ferrel
I guess I don't understand this issue. In my case both the item ids and user ids of the separate DistributedRow Matrix will match and I know the size for the entire space from a previous step where I create id maps. I suppose you are saying the the m/r code would be super simple if a row of B'

Re: cross recommender

2013-04-06 Thread Pat Ferrel
I need to do the equivalent of the xrecommender.mostSimilarItems(long[] itemIDs, int howMany) To over simplify this, in the standard Item-Based Recommender this is equivalent to looking at the item similarities from the preference matrix (similarity of item pruchases by user). In the xrecommen

Re: cross recommender

2013-04-10 Thread Pat Ferrel
like views and purchases? On Apr 8, 2013, at 2:31 PM, Ted Dunning wrote: On Sat, Apr 6, 2013 at 3:26 PM, Pat Ferrel wrote: > I guess I don't understand this issue. > > In my case both the item ids and user ids of the separate DistributedRow > Matrix will match and I know th

Re: cross recommender

2013-04-10 Thread Pat Ferrel
to use Wikipedia articles (Myrrix, GraphLab). Another idea is to use StackOverflow tags (Myrrix examples). Although they are only good for emulating implicit feedback. On Wed, Apr 10, 2013 at 6:48 PM, Ted Dunning wrote: > On Wed, Apr 10, 2013 at 10:38 AM, Pat Ferrel > wrote: > >&g

Re: cross recommender

2013-04-11 Thread Pat Ferrel
Getting this running with co-occurrence rather than using a similarity calc on user rows finally forced me to understand what is going on in the base recommender. And the answer implies further work. [B'B] is usually not calculated in the usual item based recommender. The matrix that comes out

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Pat Ferrel
Or you may want to look at recording purchases by user ID. Then use the standard recommender to train on (userID, itemsID, boolean). Then query the trained recommender thus: recommender.mostSimilarItems(long itemID, int howMany) This does what you want but uses more data than just what items wer

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Pat Ferrel
Do you not have a user ID? No matter (though if you do I'd use it) you can use the item ID as a surrogate for a user ID in the recommender. And there will be no filtering if you ask for recommender.mostSimilarItems(long itemID, int howMany), which has no user ID in the call and so will not filte

Re: cross recommender

2013-04-12 Thread Pat Ferrel
That looks like the best shortcut. It is one of the few places where the rows of one and the columns of the other are seen together. Now I know why you transpose the first input :-) But, I have begun to wonder whether it is the right thing to do for a cross recommender because you are comparing

Re: cross recommender

2013-04-15 Thread Pat Ferrel
esource. > > Robin > > > On 4/10/13 8:37 PM, "Pat Ferrel" wrote: > >> I have retail data but can't publish results from it. If I could get a >> public sample I'd share how the technique worked out. >> >> Not sure how to simulate

Re: cross recommender

2013-04-16 Thread Pat Ferrel
om/api-profiles/products-api http://www.kaggle.com/c/acm-sf-chapter-hackathon-big/data On Mon, Apr 15, 2013 at 2:03 PM, Pat Ferrel wrote: > MAJOR may be too tame a word. > > Furthermore there are several enhancements the community could make to > support retail data and retail recommen

Re: cross recommender

2013-04-16 Thread Pat Ferrel
k to > view. > > > On Tue, Apr 16, 2013 at 4:53 PM, Pat Ferrel wrote: > >> For the cross-recommender we need some replacement for a primary >> action--purchases and a secondary action--views, clicks, impressions, >> something. >> >> To use this da

Re: cross recommender

2013-04-16 Thread Pat Ferrel
u can infer the search from the data, just not all search results. On Apr 16, 2013, at 1:24 PM, Pat Ferrel wrote: I think Ted is talking about a different application of this idea: http://www.slideshare.net/tdunning/search-as-recommendation The IDs in my case must be in the same space, at very

Re: Clustering product views and sales

2013-05-07 Thread Pat Ferrel
You always will have a "cold start" problem for a subset of users--the new ones to a site. Popularity doesn't always work either. Sometimes you have a flat purchase frequency distribution, as I've seen. In these cases a metadata or content based recommender is nice to fill in. If you have no met

More Cross-recommender thoughts

2013-05-17 Thread Pat Ferrel
I'm doing an experiment creating a recommender from a Pinterest crawl I have going. I have at least three actions that relate to recommendations: Goal: recommend people you (a pinterest user) might want to follow Actions mined by crawling: follows (user, user) followed by (user, user) repinned

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Using a Hadoop version of a Mahout recommender will create some number of recs for all users as its output. Sean is talking about Myrrix I think which uses factorization to get much smaller models and so can calculate the recs at runtime for fairly large user sets. However if you are using Maho

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
On May 19, 2013 6:27 PM, "Pat Ferrel" wrote: > Using a Hadoop version of a Mahout recommender will create some number of > recs for all users as its output. Sean is talking about Myrrix I think > which uses factorization to get much smaller models and so can calculate > the

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
no user data in the matrix. Or are you talking about using the user history as the query? in which case you have to remember somewhere all users' history and look it up for the query, no? On May 19, 2013, at 8:09 PM, Ted Dunning wrote: On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel wrote: &

  1   2   3   4   5   6   7   8   >