The C implementation of Word2Vec updates the model using multi-threads
without locking. It is hard to implement it in a distributed way. In
the MLlib implementation, each work holds the entire model in memory
and output the part of model that gets updated. The driver still need
to collect and aggregate the model updates. So not only the driver but
also all workers should have enough memory to hold the full model. You
can try to reduce the vector size and set a higher min frequency to
make the model smaller. If there are good ideas about how to improve
the current implementation, please create a JIRA. -Xiangrui

On Thu, Feb 5, 2015 at 1:49 PM, Alex Minnaar <aminn...@verticalscope.com> wrote:
> I was wondering if there was any chance of getting a more distributed
> word2vec implementation.  I seem to be running out of memory from big local
> data structures such as
>
> val syn1Global = new Array[Float](vocabSize * vectorSize)
>
>
> Is there anyway chance of getting a version where these are all put in RDDs?
>
>
> Thanks,

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to