The C implementation of Word2Vec updates the model using multi-threads without locking. It is hard to implement it in a distributed way. In the MLlib implementation, each work holds the entire model in memory and output the part of model that gets updated. The driver still need to collect and aggregate the model updates. So not only the driver but also all workers should have enough memory to hold the full model. You can try to reduce the vector size and set a higher min frequency to make the model smaller. If there are good ideas about how to improve the current implementation, please create a JIRA. -Xiangrui
On Thu, Feb 5, 2015 at 1:49 PM, Alex Minnaar <aminn...@verticalscope.com> wrote: > I was wondering if there was any chance of getting a more distributed > word2vec implementation. I seem to be running out of memory from big local > data structures such as > > val syn1Global = new Array[Float](vocabSize * vectorSize) > > > Is there anyway chance of getting a version where these are all put in RDDs? > > > Thanks, --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org