Hello guys, Have you considered PFA? http://dmg.org/pfa/docs/document_structure/
As Sean noticed, "there are already 1.5 supported formats " + PMML is quite rigid. There are, at least, 2 implementations of PFA. *Scala* Hadrian: https://github.com/opendatagroup/hadrian. *Python* Titus: https//github.com/opendatagroup/hadrian. Tim On Sun, Nov 19, 2017 at 2:01 PM, Sean Owen <so...@cloudera.com> wrote: > To paraphrase, you are mostly suggesting a new API for reading/writing > models, not a new serialization? and the API should be more like the other > DataFrame writer APIs, and more extensible? > > That's better than introducing any new format for sure, as there are > already 1.5 supported formats -- the native one and partial PMML support. > It would also be great to somehow unify those. > > The only concern I guess is that it introduces a third API on top of the > existing 2, and so needs the others to go away in due course to make this > make sense, but yeah makes sense in Spark 3. > > On Sat, Nov 18, 2017 at 4:55 PM Holden Karau <hol...@pigscanfly.ca> wrote: > >> Hi folks, >> >> I've been giving a bit of thought to trying to improve ML exporting in >> Spark to support a wider variety of formats. If you implement pipeline >> stages, or you've added your own export logic, I'd especially love your >> input. >> >> A quick little draft of what I've been thinking about (after jumping back >> into my ancient PR # 9207 ) is as follows: >> >> # Background >> >> The current Spark ML writer only supports a Spark "internal" format. This >> is less than ideal since Spark MLlib supports PMML, and more formats exist. >> The goal of this design document is to allow more general support of >> saving Spark ML pipeline stages and models. >> >> Additionally Spark ML has a growing ecosystem of additional pipeline >> stages outside of core Spark, so any design should be usable by 3rd party >> pipeline stages. >> >> # Design sketch >> >> Spark's DataFrameWriter interface provides a starting point for this >> design. When writing the user will be able to specify a path, general >> options passed to the format, and importantly the format. >> >> Format discovery will be accomplished in a similar manner to Spark >> Datasources (Java's ServiceLoader), however since individual models >> providers may wish to implement their own version of a Spark supported >> format the writer will be looked by "formatname+pipelinestageclassname." >> >> This has the downside of making the code not necessarily as easy to trace >> through as the current structure, but opens up the possibility of allowing >> folks to provide model export in additional formats not supported inside of >> the models its self. >> >> # Migration path >> >> External pipeline stages may already implement the current MLWriter. To >> allow these to continue to work a GeneralMLWriter will be created as a >> parent class to the current MLWriter which will handle delegation for other >> formats as described above. >> >> For existing stages, the MLWriter's save function will be changed to >> check it's input format is the default and delegate to the current saveImpl. >> >> We would then deprecate MLWriter in the next version, remove it in Spark >> 3. >> >> Does this sound reasonable to folks? It would allow us to add PMML >> support in Spark ML pipelines and open it up for other folks to fill in the >> gaps or add other custom formats. >> >> Cheers, >> >> Holden :) >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> >