Hello, I'm trying to run LDA on a relatively large dataset (size 100-200 G), but with no luck so far.
At first I made sure that the executors have enough memory with respect to the vocabulary size and number of topics. After that I ran LDA with default EMLDAOptimizer, but learning failed after a few iteration, because the application master ran out of disk. The learning job used all space available in the usercache of the application master (cca. 100G). I noticed that this implementation uses some sort of checkopointing so I made sure it is not used, but it didn't help. Afterwards, I tried the OnlineLDAOptimizer, but it started failing at "reduce at LDAOptimizer.scala:421" with error message: "Total size of serialized results of X tasks (Y GB) is bigger than spark.driver.maxResultSize (Y GB)". I kept increasing the spark.driver.maxResultSize to tens of GB but it didn't help, just delayed this error. I tried to adjust the batch size to very small values so that I was sure it must fit into memory, but this didn't help at all. Has anyone experience with learning LDA on such a dataset? Maybe some ideas what might be wrong? I'm using spark 1.4.0 in yarn-client mode. I managed to learn a word2vec model on the same dataset with no problems at all. Thanks, Peter