I'm thinking of mechanisms like: https://github.com/apache/spark/blob/c5daccb1dafca528ccb4be65d63c943bf9a7b0f2/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala#L99
On Wed, Jan 16, 2019 at 9:46 AM Ilya Matiach <il...@microsoft.com> wrote: > > Hi Sean and Jatin, > Could you point to some examples of load() methods that use the spark version > vs the model version (or the columns available)? > I see only cases where we use the spark version (eg > https://github.com/apache/spark/blob/c04ad17ccf14a07ffdb2bf637124492a341075f2/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1239) > > Logically, I think it would be better to use a separate versioning mechanism > for models than to use the spark version, IMHO it is more reliable that way. > Especially since we patch versions of spark by merging some fixes back > sometimes, it seems less reliable to depend on a specific spark version in > the code. > In addition, models don't change as frequently as the spark version, and > having an explicit versioning mechanism makes it clearer how often the saved > model structure has changed over time. Having said that, I think you could > implement either way without issues if the code is written carefully - but > logically, if I had to choose, I would prefer having a separate versioning > mechanism for models. > Thank you, Ilya > > > > -----Original Message----- > From: Sean Owen <sro...@gmail.com> > Sent: Wednesday, January 16, 2019 10:12 AM > To: dev <dev@spark.apache.org> > Cc: Jatin Puri <purija...@gmail.com> > Subject: How to implement model versions in MLlib? > > I know some implementations of model save/load in MLlib use an explicit > version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based > on the version of Spark that wrote the model. > > Is one or the other preferred? > > See > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F23549%23discussion_r248318392&data=02%7C01%7Cilmat%40microsoft.com%7C29323b7fda27400fd9c008d67bc50f09%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636832483621703945&sdata=jiOUVzQ5LHetLSmHUvtSTbtekNSUyeK%2FDTdZDzOZrF8%3D&reserved=0 > for example. In cases like this, is it simpler still to just select all the > values written in the model and decide what to do based on the presence or > absence of columns? That seems a little more robust. It wouldn't be so much > an option if the contents or meaning of the columns had changed. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org