You mean you want to classify a large dataset ?
The partial implementation is useful when the training dataset is too large
to fit in memory. If it's does fit then you better train the forest using
the in-memory implementation.
If you want to classify a large amount of rows then you can add the
par
Hi,
I'm working improving classification throughput for a decision forest.
I was wondering about the use case for Partial Implementation.
The quick start guide suggests that Partial Implementation is designed
for building forest on large datasets.
My problem is classification after trainin
On Wed, Dec 5, 2012 at 7:03 PM, Ted Dunning wrote:
> On Wed, Dec 5, 2012 at 5:29 PM, Koobas wrote:
>
> > ...
> > Now yet another naive question.
> > Ted is probably going to go ballistic ;)
> >
>
> I hope not.
>
>
> > Assuming that simple overlap methods suck,
> > is there still a metric that wo
On Wed, Dec 5, 2012 at 5:29 PM, Koobas wrote:
> ...
> Now yet another naive question.
> Ted is probably going to go ballistic ;)
>
I hope not.
> Assuming that simple overlap methods suck,
> is there still a metric that works better than others
> (i.e. Tanimoto vs. Jaccard vs something else)?
>
Still not that odd if several clusters are getting squashed. This can
happen if the threshold increases too high or if the searcher is unable to
resolve the cube properly. By its nature, the cube is hard to reduce to a
smaller dimension.
On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon wrote:
> But
But the weight referred to is the distance between a centroid and the
mean of a distribution (a cube vertice).
This should still be very small (also BallKMeans gets it right).
On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning wrote:
> IN order to succeed here, SKM will need to have maxClusters set to 2
Ahh... this may also be a problem.
You should get better results with a Brute searcher here, but a
ProjectionSearcher with lots of projections may work well.
On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon wrote:
> So, yes, it's probably a bug of some kind since I end up with anywhere
> between 400
IN order to succeed here, SKM will need to have maxClusters set to 20,000
or so.
The maximum distance between clusters on a 10d hypercube is sqrt(10) = 3.1
or so. If three clusters get smashed together, then you have a threshold
of 1.4 or so.
On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon wrote:
I wanted there to be 2^d clusters. I was wrong and didn't check: the
radius is in fact 0.01.
What's happening is that for 10 dimension, I was expecting ~1024
clusters (or at least have small distances) but StreamingKMeans fails
on both accounts.
BallKMeans does in fact get the clusters.
So, yes,
A two class classifier is much easier to get right than a many class
classifier.
The cascaded classifier is likely to avoid your problem.
Downsampling the don't-cares will also likely help. When don't-cares
dominate the data set, the classifier can decrease overall error rates by
failing safe.
How many clusters are you talking about?
If you pick a modest number then streaming k-means should work well if it
has several times more surrogate points than there are clusters.
Also, typically a hyper-cube test works with very small cluster radius.
Try 0.1 or 0.01. Otherwise, your clusters o
Okay, please disregard the previous e-mail.
That hypothesis is toast; clustering works just fine with ball k-means.
So, the problem lies in streaming k-means somewhere.
On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
wrote:
> Hi,
>
> One of the most basic tests for streaming k-means (and k-means in
does anyone know if mahout/examples/bin/factorize-movielens-1M.sh is still
working? CLI version of splitDataset is crashing in my build (latest trunk).
Even as in "mahout splitDataset" to get the params. Wouldn't be the first time
I mucked up a build though.
Thanks for the responses. The cascading approach sounds quite interesting.
My problem though is that many of the useful items ended up in the
don't-care bucket, not that they were misclassified among the useful
categories. So, even if I were to use a cascading approach I am afraid that
many useful
I would try to do this..
First just check how the classifier is doing on the classes you care
about...
If using those examples. . You are getting good performance then run this
as a level 2 classifier. If the performance is not good then I would first
debug the issue here. .
First level... care, no
I am very happy to see that I started a lively thread.
I am a newcomer to the field, so this is all very useful.
Now yet another naive question.
Ted is probably going to go ballistic ;)
Assuming that simple overlap methods suck,
is there still a metric that works better than others
(i.e. Tanimoto
Try the cascaded model. Train the downstream model on data without the
don't-care docs or train it on documents that actually get through the
upstream model.
On Wed, Dec 5, 2012 at 4:50 PM, Raman Srinivasan wrote:
> I can exclude the "don't care" cases from the training set. However, the
> real
I can exclude the "don't care" cases from the training set. However, the
real data that I need to classify will contain mostly these useless
descriptions which I would like the model to throw out (i.e., classify as
"DON'T CARE"). If I only train with examples that are useful then how would
the mode
May I ask why are you giving the dont care examples to the algorithm.
Cant you weed them out.
Is adaptive lr the same as weighted lr.. which is used when you have
unbalanced training examples?
On Wednesday, December 5, 2012, Raman Srinivasan
wrote:
> I am trying to classify a set of short text de
(This is more or less exactly what FileIDMigrator does.)
On Wed, Dec 5, 2012 at 5:56 AM, Utkarsh Gupta wrote:
> Before giving your data to mahout you can create a mapping for original IDs
> and create a new int/long ID
> I also faced the same problem and i used this code to solve it
> You can co
I don't disagree at all with what you're saying. I never said (or
intended to say) that explanations would have to be a thorough dump of
the engine's internal computation; this would not make sense to the user
and would just overwhelm him. Picking up a couple of representative
items would be mo
21 matches
Mail list logo