Re: Decision Forest - Partial implementation

2012-12-05 Thread deneche abdelhakim
You mean you want to classify a large dataset ? The partial implementation is useful when the training dataset is too large to fit in memory. If it's does fit then you better train the forest using the in-memory implementation. If you want to classify a large amount of rows then you can add the par

Decision Forest - Partial implementation

2012-12-05 Thread Marty Kube
Hi, I'm working improving classification throughput for a decision forest. I was wondering about the use case for Partial Implementation. The quick start guide suggests that Partial Implementation is designed for building forest on large datasets. My problem is classification after trainin

Re: Mahout Amazon EMR usage cost

2012-12-05 Thread Koobas
On Wed, Dec 5, 2012 at 7:03 PM, Ted Dunning wrote: > On Wed, Dec 5, 2012 at 5:29 PM, Koobas wrote: > > > ... > > Now yet another naive question. > > Ted is probably going to go ballistic ;) > > > > I hope not. > > > > Assuming that simple overlap methods suck, > > is there still a metric that wo

Re: Mahout Amazon EMR usage cost

2012-12-05 Thread Ted Dunning
On Wed, Dec 5, 2012 at 5:29 PM, Koobas wrote: > ... > Now yet another naive question. > Ted is probably going to go ballistic ;) > I hope not. > Assuming that simple overlap methods suck, > is there still a metric that works better than others > (i.e. Tanimoto vs. Jaccard vs something else)? >

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
Still not that odd if several clusters are getting squashed. This can happen if the threshold increases too high or if the searcher is unable to resolve the cube properly. By its nature, the cube is hard to reduce to a smaller dimension. On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon wrote: > But

Re: Clustering points in a unit hypercube

2012-12-05 Thread Dan Filimon
But the weight referred to is the distance between a centroid and the mean of a distribution (a cube vertice). This should still be very small (also BallKMeans gets it right). On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning wrote: > IN order to succeed here, SKM will need to have maxClusters set to 2

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
Ahh... this may also be a problem. You should get better results with a Brute searcher here, but a ProjectionSearcher with lots of projections may work well. On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon wrote: > So, yes, it's probably a bug of some kind since I end up with anywhere > between 400

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
IN order to succeed here, SKM will need to have maxClusters set to 20,000 or so. The maximum distance between clusters on a 10d hypercube is sqrt(10) = 3.1 or so. If three clusters get smashed together, then you have a threshold of 1.4 or so. On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon wrote:

Re: Clustering points in a unit hypercube

2012-12-05 Thread Dan Filimon
I wanted there to be 2^d clusters. I was wrong and didn't check: the radius is in fact 0.01. What's happening is that for 10 dimension, I was expecting ~1024 clusters (or at least have small distances) but StreamingKMeans fails on both accounts. BallKMeans does in fact get the clusters. So, yes,

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Ted Dunning
A two class classifier is much easier to get right than a many class classifier. The cascaded classifier is likely to avoid your problem. Downsampling the don't-cares will also likely help. When don't-cares dominate the data set, the classifier can decrease overall error rates by failing safe.

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
How many clusters are you talking about? If you pick a modest number then streaming k-means should work well if it has several times more surrogate points than there are clusters. Also, typically a hyper-cube test works with very small cluster radius. Try 0.1 or 0.01. Otherwise, your clusters o

Re: Clustering points in a unit hypercube

2012-12-05 Thread Dan Filimon
Okay, please disregard the previous e-mail. That hypothesis is toast; clustering works just fine with ball k-means. So, the problem lies in streaming k-means somewhere. On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon wrote: > Hi, > > One of the most basic tests for streaming k-means (and k-means in

splitDataset

2012-12-05 Thread Pat Ferrel
does anyone know if mahout/examples/bin/factorize-movielens-1M.sh is still working? CLI version of splitDataset is crashing in my build (latest trunk). Even as in "mahout splitDataset" to get the params. Wouldn't be the first time I mucked up a build though.

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Raman Srinivasan
Thanks for the responses. The cascading approach sounds quite interesting. My problem though is that many of the useful items ended up in the don't-care bucket, not that they were misclassified among the useful categories. So, even if I were to use a cascading approach I am afraid that many useful

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Mohit Singh
I would try to do this.. First just check how the classifier is doing on the classes you care about... If using those examples. . You are getting good performance then run this as a level 2 classifier. If the performance is not good then I would first debug the issue here. . First level... care, no

Re: Mahout Amazon EMR usage cost

2012-12-05 Thread Koobas
I am very happy to see that I started a lively thread. I am a newcomer to the field, so this is all very useful. Now yet another naive question. Ted is probably going to go ballistic ;) Assuming that simple overlap methods suck, is there still a metric that works better than others (i.e. Tanimoto

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Ted Dunning
Try the cascaded model. Train the downstream model on data without the don't-care docs or train it on documents that actually get through the upstream model. On Wed, Dec 5, 2012 at 4:50 PM, Raman Srinivasan wrote: > I can exclude the "don't care" cases from the training set. However, the > real

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Raman Srinivasan
I can exclude the "don't care" cases from the training set. However, the real data that I need to classify will contain mostly these useless descriptions which I would like the model to throw out (i.e., classify as "DON'T CARE"). If I only train with examples that are useful then how would the mode

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Mohit Singh
May I ask why are you giving the dont care examples to the algorithm. Cant you weed them out. Is adaptive lr the same as weighted lr.. which is used when you have unbalanced training examples? On Wednesday, December 5, 2012, Raman Srinivasan wrote: > I am trying to classify a set of short text de

Re: Regarding Mahout Item Recommendation engine - numberformatexception on noninteger column (eg: ISBN (alphanumeric value))

2012-12-05 Thread Sean Owen
(This is more or less exactly what FileIDMigrator does.) On Wed, Dec 5, 2012 at 5:56 AM, Utkarsh Gupta wrote: > Before giving your data to mahout you can create a mapping for original IDs > and create a new int/long ID > I also faced the same problem and i used this code to solve it > You can co

Re: Mahout Amazon EMR usage cost

2012-12-05 Thread Paulo Villegas
I don't disagree at all with what you're saying. I never said (or intended to say) that explanations would have to be a thorough dump of the engine's internal computation; this would not make sense to the user and would just overwhelm him. Picking up a couple of representative items would be mo