We are working on import/export for MLlib models. The umbrella JIRA is https://issues.apache.org/jira/browse/SPARK-4587. In 1.3, we are going to have save/load for linear models, naive Bayes, ALS, and tree models. I created a JIRA for Word2Vec and set the target version to 1.4. If anyone is interested in working on it, please ping me on the JIRA. -Xiangrui
On Thu, Feb 5, 2015 at 9:11 AM, Carsten Schnober <schno...@ukp.informatik.tu-darmstadt.de> wrote: > As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in > our Hadoop cluster and here's my issue with classic Java serialization of > the model: I don't have SSH access to the cluster master node. > Here's my code for computing the model: > > val input = sc.textFile("README.md").map(line => line.split(" ").toSeq) > val word2vec = new Word2Vec(); > val model = word2vec.fit(input); > val oos = new ObjectOutputStream(new FileOutputStream(modelFile)); > oos.writeObject(model); > oos.close(); > > I can do that locally and get the file as desired. But that is of little use > for me if the file is stored on the master. > > I've alternatively serialized the vectors to HDFS using this code: > > val vectors = model.getVectors; > val output = sc.parallelize(vectors.toSeq); > output.saveAsObjectFile(modelFile); > > Indeed, this results in a serialization on HDFS so I can access it as a > user. However, I have not figured out how to create a new Word2VecModel > object from those files. > > Any clues? > Thanks! > Carsten > > > > MLnick wrote >> Currently I see the word2vec model is collected onto the master, so the >> model itself is not distributed. >> >> >> I guess the question is why do you need a distributed model? Is the vocab >> size so large that it's necessary? For model serving in general, unless >> the model is truly massive (ie cannot fit into memory on a modern high end >> box with 64, or 128GB ram) then single instance is way faster and simpler >> (using a cluster of machines is more for load balancing / fault >> tolerance). >> >> >> >> >> What is your use case for model serving? >> >> >> — >> Sent from Mailbox >> >> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh < > >> duy.huynh.uiv@ > >> > wrote: >> >>> you're right, serialization works. >>> what is your suggestion on saving a "distributed" model? so part of the >>> model is in one cluster, and some other parts of the model are in other >>> clusters. during runtime, these sub-models run independently in their >>> own >>> clusters (load, train, save). and at some point during run time these >>> sub-models merge into the master model, which also loads, trains, and >>> saves >>> at the master level. >>> much appreciated. >>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks < > >> evan.sparks@ > >> > >>> wrote: >>>> There's some work going on to support PMML - >>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been >>>> merged into master. >>>> >>>> What are you used to doing in other environments? In R I'm used to >>>> running >>>> save(), same with matlab. In python either pickling things or dumping to >>>> json seems pretty common. (even the scikit-learn docs recommend pickling >>>> - >>>> http://scikit-learn.org/stable/modules/model_persistence.html). These >>>> all >>>> seem basically equivalent java serialization to me.. >>>> >>>> Would some helper functions (in, say, mllib.util.modelpersistence or >>>> something) make sense to add? >>>> >>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh < > >> duy.huynh.uiv@ > >> > >>>> wrote: >>>> >>>>> that works. is there a better way in spark? this seems like the most >>>>> common feature for any machine learning work - to be able to save your >>>>> model after training it and load it later. >>>>> >>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks < > >> evan.sparks@ > >> > >>>>> wrote: >>>>> >>>>>> Plain old java serialization is one straightforward approach if you're >>>>>> in java/scala. >>>>>> >>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll < > >> duy.huynh.uiv@ > >> > wrote: >>>>>> >>>>>>> what is the best way to save an mllib model that you just trained and >>>>>>> reload >>>>>>> it in the future? specifically, i'm using the mllib word2vec >>>>>>> model... >>>>>>> thanks. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: > >> user-unsubscribe@.apache > >>>>>>> For additional commands, e-mail: > >> user-help@.apache > >>>>>>> >>>>>>> >>>>>> >>>>> >>>> > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p21517.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org