To paraphrase, you are mostly suggesting a new API for reading/writing
models, not a new serialization? and the API should be more like the other
DataFrame writer APIs, and more extensible?

That's better than introducing any new format for sure, as there are
already 1.5 supported formats -- the native one and partial PMML support.
It would also be great to somehow unify those.

The only concern I guess is that it introduces a third API on top of the
existing 2, and so needs the others to go away in due course to make this
make sense, but yeah makes sense in Spark 3.

On Sat, Nov 18, 2017 at 4:55 PM Holden Karau <hol...@pigscanfly.ca> wrote:

> Hi folks,
>
> I've been giving a bit of thought to trying to improve ML exporting in
> Spark to support a wider variety of formats. If you implement pipeline
> stages, or you've added your own export logic, I'd especially love your
> input.
>
> A quick little draft of what I've been thinking about (after jumping back
> into my ancient PR # 9207 ) is as follows:
>
> # Background
>
> The current Spark ML writer only supports a Spark "internal" format. This
> is less than ideal since Spark MLlib supports PMML, and more formats exist.
> The goal of this design document is to allow more general support of
> saving Spark ML pipeline stages and models.
>
> Additionally Spark ML has a growing ecosystem of additional pipeline
> stages outside of core Spark, so any design should be usable by 3rd party
> pipeline stages.
>
> # Design sketch
>
> Spark's DataFrameWriter interface provides a starting point for this
> design. When writing the user will be able to specify a path, general
> options passed to the format, and importantly the format.
>
> Format discovery will be accomplished in a similar manner to Spark
> Datasources (Java's ServiceLoader), however since individual models
> providers may wish to implement their own version of a Spark supported
> format the writer will be looked by "formatname+pipelinestageclassname."
>
> This has the downside of making the code not necessarily as easy to trace
> through as the current structure, but opens up the possibility of allowing
> folks to provide model export in additional formats not supported inside of
> the models its self.
>
> # Migration path
>
> External pipeline stages may already implement the current MLWriter. To
> allow these to continue to work a GeneralMLWriter will be created as a
> parent class to the current MLWriter which will handle delegation for other
> formats as described above.
>
> For existing stages, the MLWriter's save function will be changed to check
> it's input format is the default and delegate to the current saveImpl.
>
> We would then deprecate MLWriter in the next version, remove it in Spark 3.
>
> Does this sound reasonable to folks? It would allow us to add PMML support
> in Spark ML pipelines and open it up for other folks to fill in the gaps or
> add other custom formats.
>
> Cheers,
>
> Holden :)
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Reply via email to