Hello,

I'm trying to run LDA on a relatively large dataset (size 100-200 G), but
with no luck so far.

At first I made sure that the executors have enough memory with respect to
the vocabulary size and number of topics.

After that I ran LDA with default EMLDAOptimizer, but learning failed after
a few iteration, because the application master ran out of disk. The
learning job used all space available in the usercache of the application
master (cca. 100G). I noticed that this implementation uses some sort of
checkopointing so I made sure it is not used, but it didn't help.

Afterwards, I tried the OnlineLDAOptimizer, but it started failing at
"reduce at LDAOptimizer.scala:421" with error message: "Total size of
serialized results of X tasks (Y GB) is bigger than
spark.driver.maxResultSize (Y GB)". I kept increasing the
spark.driver.maxResultSize to tens of GB but it didn't help, just delayed
this error. I tried to adjust the batch size to very small values so that I
was sure it must fit into memory, but this didn't help at all.

Has anyone experience with learning LDA on such a dataset? Maybe some ideas
what might be wrong?

I'm using spark 1.4.0 in yarn-client mode. I managed to learn a word2vec
model on the same dataset with no problems at all.

Thanks,
Peter

Reply via email to