Problem in Restoring ML Pipeline with UDF

Artemis User Tue, 08 Jun 2021 08:50:21 -0700

We have a feature engineering transformer defined as a custom class withUDF as follows:

class FeatureModder extends Transformer with DefaultParamsWritable withDefaultParamsReadable[FeatureModder] {

    val uid: String = "FeatureModder"+randomUUID

final val inputCol: Param[String] = new Param[String](this,"inputCos", "input column")

    final def setInputCol(col:String) = set(inputCol, col)

final val outputCol: Param[String] = new Param[String](this,"outputCol", "output column")

    final def setOutputCol(col:String) = set(outputCol, col)

final val size: Param[String] = new Param[String](this, "size","length of output vector")

    final def setSize = (n:Int) => set(size, n.toString)

    override def transform(data: Dataset[_]) = {
        val modUDF = udf({n: Int => n % $(size).toInt})

data.withColumn($(outputCol),modUDF(col($(inputCol)).cast(IntegerType)))

def transformSchema(schema: org.apache.spark.sql.types.StructType):org.apache.spark.sql.types.StructType = {

        val actualType = schema($(inputCol)).dataType

require(actualType.equals(IntegerType) ||actualType.equals(DoubleType), s"Input column must be of numeric type") DataTypes.createStructType(schema.fields :+DataTypes.createStructField($(outputCol), IntegerType, false))

    }

    override def copy(extra: ParamMap): Transformer = copy(extra)
}

This was included in an ML pipeline, fitted into a model and persistedto a disk file. When we try to load the pipeline model in a separatenotebook (we use Zeppelin), an exception is thrown complaining class notfund.

java.lang.ClassNotFoundException: $line103090609224.$read$FeatureModderatscala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) atjava.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) atjava.base/java.lang.Class.forName0(Native Method) atjava.base/java.lang.Class.forName(Class.java:398) atorg.apache.spark.util.Utils$.classForName(Utils.scala:207) atorg.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:630)atorg.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)atscala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)atscala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)atscala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)at scala.collection.TraversableLike.map(TraversableLike.scala:238) atscala.collection.TraversableLike.map$(TraversableLike.scala:231) atscala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) atorg.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)atorg.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)at scala.util.Try$.apply(Try.scala:213) atorg.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)atorg.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)atorg.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)atorg.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)atorg.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)atorg.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)at scala.util.Try$.apply(Try.scala:213) atorg.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)atorg.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)atorg.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) atorg.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) atorg.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337) ... 40elided Could someone help explaining why? My guess was the classdefinition is not in the classpath. The question is how to include theclass definition or class metadata as part of the pipeline modelserialization? or include the class definition in a notebook (we didinclude the class definition in the notebook that loads the pipeline model)?


Thanks a lot in advance for your help!

ND

Problem in Restoring ML Pipeline with UDF

Reply via email to