I'm thinking of mechanisms like:
https://github.com/apache/spark/blob/c5daccb1dafca528ccb4be65d63c943bf9a7b0f2/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala#L99

On Wed, Jan 16, 2019 at 9:46 AM Ilya Matiach <il...@microsoft.com> wrote:
>
> Hi Sean and Jatin,
> Could you point to some examples of load() methods that use the spark version 
> vs the model version (or the columns available)?
> I see only cases where we use the spark version (eg 
> https://github.com/apache/spark/blob/c04ad17ccf14a07ffdb2bf637124492a341075f2/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1239)
>
> Logically, I think it would be better to use a separate versioning mechanism 
> for models than to use the spark version, IMHO it is more reliable that way.
> Especially since we patch versions of spark by merging some fixes back 
> sometimes, it seems less reliable to depend on a specific spark version in 
> the code.
> In addition, models don't change as frequently as the spark version, and 
> having an explicit versioning mechanism makes it clearer how often the saved 
> model structure has changed over time.  Having said that, I think you could 
> implement either way without issues if the code is written carefully - but 
> logically, if I had to choose, I would prefer having a separate versioning 
> mechanism for models.
> Thank you, Ilya
>
>
>
> -----Original Message-----
> From: Sean Owen <sro...@gmail.com>
> Sent: Wednesday, January 16, 2019 10:12 AM
> To: dev <dev@spark.apache.org>
> Cc: Jatin Puri <purija...@gmail.com>
> Subject: How to implement model versions in MLlib?
>
> I know some implementations of model save/load in MLlib use an explicit 
> version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based 
> on the version of Spark that wrote the model.
>
> Is one or the other preferred?
>
> See 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F23549%23discussion_r248318392&amp;data=02%7C01%7Cilmat%40microsoft.com%7C29323b7fda27400fd9c008d67bc50f09%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636832483621703945&amp;sdata=jiOUVzQ5LHetLSmHUvtSTbtekNSUyeK%2FDTdZDzOZrF8%3D&amp;reserved=0
> for example. In cases like this, is it simpler still to just select all the 
> values written in the model and decide what to do based on the presence or 
> absence of columns? That seems a little more robust. It wouldn't be so much 
> an option if the contents or meaning of the columns had changed.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to