Re: Thoughts on extedning ML exporting in Spark?

Timur Shenkao Sun, 19 Nov 2017 09:36:54 -0800

Hello guys,

Have you considered PFA? http://dmg.org/pfa/docs/document_structure/


As Sean noticed, "there are already 1.5 supported formats " + PMML is quite
rigid.

There are, at least, 2 implementations of PFA.
*Scala* Hadrian:  https://github.com/opendatagroup/hadrian.
*Python* Titus: https//github.com/opendatagroup/hadrian.



Tim

On Sun, Nov 19, 2017 at 2:01 PM, Sean Owen <so...@cloudera.com> wrote:

> To paraphrase, you are mostly suggesting a new API for reading/writing
> models, not a new serialization? and the API should be more like the other
> DataFrame writer APIs, and more extensible?
>
> That's better than introducing any new format for sure, as there are
> already 1.5 supported formats -- the native one and partial PMML support.
> It would also be great to somehow unify those.
>
> The only concern I guess is that it introduces a third API on top of the
> existing 2, and so needs the others to go away in due course to make this
> make sense, but yeah makes sense in Spark 3.
>
> On Sat, Nov 18, 2017 at 4:55 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> Hi folks,
>>
>> I've been giving a bit of thought to trying to improve ML exporting in
>> Spark to support a wider variety of formats. If you implement pipeline
>> stages, or you've added your own export logic, I'd especially love your
>> input.
>>
>> A quick little draft of what I've been thinking about (after jumping back
>> into my ancient PR # 9207 ) is as follows:
>>
>> # Background
>>
>> The current Spark ML writer only supports a Spark "internal" format. This
>> is less than ideal since Spark MLlib supports PMML, and more formats exist.
>> The goal of this design document is to allow more general support of
>> saving Spark ML pipeline stages and models.
>>
>> Additionally Spark ML has a growing ecosystem of additional pipeline
>> stages outside of core Spark, so any design should be usable by 3rd party
>> pipeline stages.
>>
>> # Design sketch
>>
>> Spark's DataFrameWriter interface provides a starting point for this
>> design. When writing the user will be able to specify a path, general
>> options passed to the format, and importantly the format.
>>
>> Format discovery will be accomplished in a similar manner to Spark
>> Datasources (Java's ServiceLoader), however since individual models
>> providers may wish to implement their own version of a Spark supported
>> format the writer will be looked by "formatname+pipelinestageclassname."
>>
>> This has the downside of making the code not necessarily as easy to trace
>> through as the current structure, but opens up the possibility of allowing
>> folks to provide model export in additional formats not supported inside of
>> the models its self.
>>
>> # Migration path
>>
>> External pipeline stages may already implement the current MLWriter. To
>> allow these to continue to work a GeneralMLWriter will be created as a
>> parent class to the current MLWriter which will handle delegation for other
>> formats as described above.
>>
>> For existing stages, the MLWriter's save function will be changed to
>> check it's input format is the default and delegate to the current saveImpl.
>>
>> We would then deprecate MLWriter in the next version, remove it in Spark
>> 3.
>>
>> Does this sound reasonable to folks? It would allow us to add PMML
>> support in Spark ML pipelines and open it up for other folks to fill in the
>> gaps or add other custom formats.
>>
>> Cheers,
>>
>> Holden :)
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>

Re: Thoughts on extedning ML exporting in Spark?

Reply via email to