SparkML. RandomForest predict performance for small dataset.

Eugene Morozov Wed, 09 Dec 2015 12:44:12 -0800

Hello,

I'm using RandomForest pipeline (ml package). Everything is working fine
(learning models, prediction, etc), but I'd like to tune it for the case,
when I predict with small dataset.
My issue is that when I apply


(PipelineModel)model.transform(dataset)

The model consists of the following stages:

StringIndexerModel labelIndexer = new StringIndexer()...
RandomForestClassifier classifier = new RandomForestClassifier()...
IndexToString labelConverter = new IndexToString()...
Pipeline pipeline = new Pipeline().setStages(new
PipelineStage[]{labelIndexer, classifier, labelConverter});

it obviously takes some time to predict, but when my dataset consists of
just 1 (record) I'd expect it to be really fast.

My observations are even though I use small dataset Spark broadcasts
something over and over again. That's fine, when I load my (serialized)
model from disk and use it just once for prediction, but when I use the
same model in a loop for the same! dataset, I'd say that everything should
already be on a worker nodes, thus I'd expect prediction to be fast.
It takes 20 seconds to predict dataset once (with one input row) and all
subsequent predictions over the same dataset with the same model takes
roughly 10 seconds.
My goal is to have 0.5 - 1 second response.

My intention was to keep learned model on a driver (that's stay online with
created SparkContext) to use it for any subsequent predictions, but these
10 seconds predictions basically kill the whole idea.

Is it possible somehow to distribute the model over the cluster upfront so
that the prediction is really fast?
Are there any specific params to apply to the PipelineModel to stay
resident on a worker nodes? Anything to keep and reuse broadcasted data?

Thanks in advance.
--
Be well!
Jean Morozov

SparkML. RandomForest predict performance for small dataset.

Reply via email to