Re: LDA on a large dataset

2015-07-20 Thread Feynman Liang
LDAOptimizer.scala:421 collects to driver a numTopics by vocabSize matrix of summary statistics. I suspect that this is what's causing the failure. One thing you may try doing is decreasing the vocabulary size. One possibility would be to use a HashingTF if you don't mind dimension reduction via h

LDA on a large dataset

2015-07-20 Thread Peter Zvirinsky
Hello, I'm trying to run LDA on a relatively large dataset (size 100-200 G), but with no luck so far. At first I made sure that the executors have enough memory with respect to the vocabulary size and number of topics. After that I ran LDA with default EMLDAOptimizer, but learning failed after a