Re: Persisting PySpark ML Pipelines that include custom Transformers

Nicholas Chammas Fri, 19 Aug 2016 14:36:56 -0700

My pipeline (i.e. a 2.0 Pipeline) is mostly made of the built-in
transformers and estimators that come with Spark. One transformer, however,
is custom (i.e. I subclassed Transformer), and all it does is use a UDF to
append a VectorUDT column to a DataFrame.

To speak in more concrete terms, my custom transformer takes two columns
that contain people’s names, and appends a column of features describing
how similar those names are.

So I’m not sure where I stand as far as being able to persist this Pipeline
which includes my custom Transformer. It sounds like you’re saying I need
to do the work of defining how to persist and unpersist this Transformer
myself. Is that correct?

Is there an example I can reference of how I might do that? Looking at this
instance
<https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433>
of _to_java() from a built-in Estimator, for example, doesn’t give me any
clues as to how I’d do it for my custom Transformer.

Nick

On Fri, Aug 19, 2016 at 3:16 PM Holden Karau <hol...@pigscanfly.ca> wrote:

> I don't think we've given a lot of thought to model persistence for custom
> Python models yet - if the Python models is wrapping a JVM model using the
> JavaMLWritable along with '_to_java' should work provided your Java model
> alread is saveable. On the other hand - if your model isn't wrapping a Java
> model you shouldn't feel the need to shoehorn yourself into this approach -
> in either case much of the persistence work is up to you it's just a matter
> if you do it in the JVM or Python.
>
> On Friday, August 19, 2016, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
>> I understand persistence for PySpark ML pipelines is already present in
>> 2.0, and further improvements are being made for 2.1 (e.g. SPARK-13786
>> <https://issues.apache.org/jira/browse/SPARK-13786>).
>>
>> I’m having trouble, though, persisting a pipeline that includes a custom
>> Transformer (see SPARK-17025
>> <https://issues.apache.org/jira/browse/SPARK-17025>). It appears that
>> there is a magic _to_java() method that I need to implement.
>>
>> Is the intention that developers implementing custom Transformers would
>> also specify how it should be persisted, or are there ideas about how to
>> make this automatic? I searched on JIRA but I’m not sure if I missed an
>> issue that already addresses this problem.
>>
>> Nick
>> 
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

Re: Persisting PySpark ML Pipelines that include custom Transformers

Reply via email to