Re: How To Save TF-IDF Model In PySpark

Jerry Lam Fri, 15 Jan 2016 16:51:51 -0800

Can you save it to parquet with the vector in one field?

Sent from my iPhone


> On 15 Jan, 2016, at 7:33 pm, Andy Davidson <a...@santacruzintegration.com> 
> wrote:
> 
> Are you using 1.6.0 or an older version?
> 
> I think I remember something in 1.5.1 saying save was not implemented in 
> python.
> 
> 
> The current doc does not say anything about save()
> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
> 
> http://spark.apache.org/docs/latest/ml-guide.html#saving-and-loading-pipelines
> "Often times it is worth it to save a model or a pipeline to disk for later 
> use. In Spark 1.6, a model import/export functionality was added to the 
> Pipeline API. Most basic transformers are supported as well as some of the 
> more basic ML models. Please refer to the algorithm’s API documentation to 
> see if saving and loading is supported."
> 
> andy
> 
> 
> 
> 
> From: Asim Jalis <asimja...@gmail.com>
> Date: Friday, January 15, 2016 at 4:02 PM
> To: "user @spark" <user@spark.apache.org>
> Subject: How To Save TF-IDF Model In PySpark
> 
> Hi,
> 
> I am trying to save a TF-IDF model in PySpark. Looks like this is not
> supported. 
> 
> Using `model.save()` causes:
> 
> AttributeError: 'IDFModel' object has no attribute 'save'
> 
> Using `pickle` causes:
> 
> TypeError: can't pickle lock objects
> 
> Does anyone have suggestions 
> 
> Thanks!
> 
> Asim
> 
> Here is the full repro. Start pyspark shell and then run this code in
> it.
> 
> ```
> # Imports
> from pyspark import SparkContext
> from pyspark.mllib.feature import HashingTF
> 
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.regression import Vectors
> from pyspark.mllib.feature import IDF
> 
> # Create some data
> n = 4
> freqs = [
>     Vectors.sparse(n, (1, 3), (1.0, 2.0)), 
>     Vectors.dense([0.0, 1.0, 2.0, 3.0]), 
>     Vectors.sparse(n, [1], [1.0])]
> data = sc.parallelize(freqs)
> idf = IDF()
> model = idf.fit(data)
> tfidf = model.transform(data)
> 
> # View
> for r in tfidf.collect(): print(r)
> 
> # Try to save it
> model.save("foo.model")
> 
> # Try to save it with Pickle
> import pickle
> pickle.dump(model, open("model.p", "wb"))
> pickle.dumps(model)
> ```

Re: How To Save TF-IDF Model In PySpark

Reply via email to