Hi Simon,

That looks like an error from the seq2sparse job you're using to
vectorize the code.
I think it's very surprising to get an error when vectorizing, but
more others more experienced than me should probably comment. :)

The line numbers don't match what I have in my version of Mahout (a
forked version of trunk).

If I'm not mistaken there should be an "inner" exception thrown by a
mapper or reducer that tells us more. Can you please look through the
error log and see if there's anything else?

As a side note, I'm clustering the 20 newsgroups data set (~20K
documents at ~20MB in total) and it's working fine.

Thanks!
Dan

On Sat, Mar 9, 2013 at 5:44 PM,  <[email protected]> wrote:
> Hi there,
>
> I am doing a fairly silly experiment to measure hadoop performance. As part 
> of this I have extracted emails from the Enron database and I am clustering 
> them using a proprietary method for clustering short messages (ie. tweets, 
> emails, sms's) and benchmarking clusters in various configurations.
>
> As part of this I have been benchmarking a single processing machine (my new 
> laptop) this is a hp elite book with 32mb ram,sdds and nice processors ect 
> ect, the point is that when explaining to people that we need hadoop I can 
> show them that a laptop is really really useless and likely to remain so (I 
> know this is obvious, come and work in a corporate and find out what else you 
> have to do to earn a living! Then tell me that I am silly! )
>
> Anyhooo...  I have seen reasonable behaviours from the algorithms I have 
> built (ie. for very small data map reduce puts an overhead on the processing, 
> but once you get reasonably large the parallelism wins) but when I try with 
> mahout's kmeans I get an odd behaviour.
>
> When I get to ~175k individual files /175mb input data I get an exception
>
> Exception in thread "main" java.lang.IllegalStateException: Job failed!
>         at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
>         at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
>         at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> Is this because I am entirely inept and have missed something, or is this 
> because of a limitation on mahout sequence files due to them not being aimed 
> at loads of short messages that really can't be clustered anyway due to them 
> having no information in them, hell?
>
> Simon
>
>
>
> ----
> Dr. Simon Thompson

Reply via email to