Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.

I encountered two frustrating issues and would really appreciate some
advice:

1)  RandomForestClassificationModel is effectively not serializable (I
assume it's referencing something that can't be serialized, since it itself
extends serializable), so I ended up with the well-known exception:
org.apache.spark.SparkException: Task not serializable.
Basically, my original intention was to pass the model as a parameter
because which model we use is dynamic based on what record we are
predicting on.

Has anyone else encountered this? Is this currently being addressed? I
would expect objects from Spark's own libraries be able to be used
seamlessly in their applications without these types of exceptions.

2) The RandomForestClassificationModel.load method appears to hang
indefinitely when executed from inside a map function (which I assume is
passed to the executor). So, I basically cannot load a model from a worker.
We have multiple "profiles" that use differently trained models, which are
accessed from within a map function to run predictions on different sets of
data.
The thread that is hanging has this as the latest (most pertinent) code:
org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
Looking at the code in github, it appears that it is calling sc.textFile. I
could not find anything stating that this particular function would not
work from within a map function.

Are there any suggestions as to how I can get this model to work on a real
production job (either by allowing it to be serializable and passed around
or loaded from a worker)?

I've extenisvely POCed this model (saving, loading, transforming, training,
etc.), however this is the first time I'm attempting to use it from within
a real application.

Sumona

Reply via email to