Benoit, could you also paste us output of `hdfs -ls /path/to/your/docterm_matrix/part-*`? Cvb map-side parallelism benefits from an even distribution of doc-term vectors across your input part files.
On Mon, Mar 4, 2013 at 8:34 AM, Jake Mannix <[email protected]> wrote: > Can you send us your command line args? Is that for 1 iteration ? That > would be very very slow > > On Monday, March 4, 2013, Benoit Mathieu wrote: > > > Hi mahout users, > > > > I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout > > cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k > > "terms". Documents are very sparse, each of them contains only 100 terms. > > I'd like to extract "topics" from that. > > > > I have generated mahout vectors from my data using a simple java program, > > and using RandomAccessSparseVector. > > > > I successfully launched the "mahout cvb with" job with num_topics=200, > but > > the job seems very slow: 70 running map tasks took 10mn to process about > > 25000 documents on my cluster. > > > > So my questions are: > > - Does this job require specific Vector class for good performance ? > > - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k > > terms ? > > > > Thanks for any insights. > > > > ++ > > benoit > > > > > -- > > -jake >
