Re: word2vec: how to save an mllib model and reload it?

Xiangrui Meng Mon, 09 Feb 2015 11:29:54 -0800

We are working on import/export for MLlib models. The umbrella JIRA is
https://issues.apache.org/jira/browse/SPARK-4587. In 1.3, we are going
to have save/load for linear models, naive Bayes, ALS, and tree
models. I created a JIRA for Word2Vec and set the target version to
1.4. If anyone is interested in working on it, please ping me on the
JIRA. -Xiangrui


On Thu, Feb 5, 2015 at 9:11 AM, Carsten Schnober
<schno...@ukp.informatik.tu-darmstadt.de> wrote:
> As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in
> our Hadoop cluster and here's my issue with classic Java serialization of
> the model: I don't have SSH access to the cluster master node.
> Here's my code for computing the model:
>
>     val input = sc.textFile("README.md").map(line => line.split(" ").toSeq)
>     val word2vec = new Word2Vec();
>     val model = word2vec.fit(input);
>     val oos = new ObjectOutputStream(new FileOutputStream(modelFile));
>     oos.writeObject(model);
>     oos.close();
>
> I can do that locally and get the file as desired. But that is of little use
> for me if the file is stored on the master.
>
> I've alternatively serialized the vectors to HDFS using this code:
>
>     val vectors = model.getVectors;
>     val output = sc.parallelize(vectors.toSeq);
>     output.saveAsObjectFile(modelFile);
>
> Indeed, this results in a serialization on HDFS so I can access it as a
> user. However, I have not figured out how to create a new Word2VecModel
> object from those files.
>
> Any clues?
> Thanks!
> Carsten
>
>
>
> MLnick wrote
>> Currently I see the word2vec model is collected onto the master, so the
>> model itself is not distributed.
>>
>>
>> I guess the question is why do you need  a distributed model? Is the vocab
>> size so large that it's necessary? For model serving in general, unless
>> the model is truly massive (ie cannot fit into memory on a modern high end
>> box with 64, or 128GB ram) then single instance is way faster and simpler
>> (using a cluster of machines is more for load balancing / fault
>> tolerance).
>>
>>
>>
>>
>> What is your use case for model serving?
>>
>>
>> —
>> Sent from Mailbox
>>
>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh &lt;
>
>> duy.huynh.uiv@
>
>> &gt; wrote:
>>
>>> you're right, serialization works.
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their
>>> own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and
>>> saves
>>> at the master level.
>>> much appreciated.
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks &lt;
>
>> evan.sparks@
>
>> &gt;
>>> wrote:
>>>> There's some work going on to support PMML -
>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>>>> merged into master.
>>>>
>>>> What are you used to doing in other environments? In R I'm used to
>>>> running
>>>> save(), same with matlab. In python either pickling things or dumping to
>>>> json seems pretty common. (even the scikit-learn docs recommend pickling
>>>> -
>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>>> all
>>>> seem basically equivalent java serialization to me..
>>>>
>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>> something) make sense to add?
>>>>
>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh &lt;
>
>> duy.huynh.uiv@
>
>> &gt;
>>>> wrote:
>>>>
>>>>> that works.  is there a better way in spark?  this seems like the most
>>>>> common feature for any machine learning work - to be able to save your
>>>>> model after training it and load it later.
>>>>>
>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks &lt;
>
>> evan.sparks@
>
>> &gt;
>>>>> wrote:
>>>>>
>>>>>> Plain old java serialization is one straightforward approach if you're
>>>>>> in java/scala.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll &lt;
>
>> duy.huynh.uiv@
>
>> &gt; wrote:
>>>>>>
>>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>>> reload
>>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>>> model...
>>>>>>> thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>
>> user-unsubscribe@.apache
>
>>>>>>> For additional commands, e-mail:
>
>> user-help@.apache
>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p21517.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

Reply via email to