Hi, does anyone know how Spark's model serialization format can support feature 
evolution with backward compatibility support? Specifically when new data that 
include both old features and newly added features is fed into old models 
trained with old set of features, the old models should be able to ignore those 
newly added features and use only old features to generate prediction.


However, it seems Spark model serialization format only includes weights and 
there is no information how weights are aligned with feature id / name. Thus 
adding new features will cause prediction to throw following exception.



sameModel = LogisticRegressionModel.load("sample-model-old") // old model 
trained with old features

predictions = sameModel.transform(newData) // new data with both old and new 
features


 Caused by: java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: 
Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 100, 
y.size = 49  at scala.Predef$.require(Predef.scala:233)  at 
org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)  at 
org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$27.apply(LogisticRegression.scala:753)



It seems the serialized model format should include the feature id / name for 
each weight. In that way, the model can align features and weights properly and 
new features will use weight zero. The old model won't be able to take 
advantage of new feature thus the model is updated. But at least the prediction 
won't fail.


This functionality is important when we need to evolve models in production. 
Often we discover new features to improve model's accuracy and we need a 
reliable mechanism to upgrade the data pipeline and models.


Thanks.


Ming

Reply via email to