LDAOptimizer.scala:421 collects to driver a numTopics by vocabSize matrix
of summary statistics. I suspect that this is what's causing the failure.
One thing you may try doing is decreasing the vocabulary size. One
possibility would be to use a HashingTF if you don't mind dimension
reduction via h
Hello,
I'm trying to run LDA on a relatively large dataset (size 100-200 G), but
with no luck so far.
At first I made sure that the executors have enough memory with respect to
the vocabulary size and number of topics.
After that I ran LDA with default EMLDAOptimizer, but learning failed after
a