Hello, I'm using RandomForest pipeline (ml package). Everything is working fine (learning models, prediction, etc), but I'd like to tune it for the case, when I predict with small dataset. My issue is that when I apply
(PipelineModel)model.transform(dataset) The model consists of the following stages: StringIndexerModel labelIndexer = new StringIndexer()... RandomForestClassifier classifier = new RandomForestClassifier()... IndexToString labelConverter = new IndexToString()... Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{labelIndexer, classifier, labelConverter}); it obviously takes some time to predict, but when my dataset consists of just 1 (record) I'd expect it to be really fast. My observations are even though I use small dataset Spark broadcasts something over and over again. That's fine, when I load my (serialized) model from disk and use it just once for prediction, but when I use the same model in a loop for the same! dataset, I'd say that everything should already be on a worker nodes, thus I'd expect prediction to be fast. It takes 20 seconds to predict dataset once (with one input row) and all subsequent predictions over the same dataset with the same model takes roughly 10 seconds. My goal is to have 0.5 - 1 second response. My intention was to keep learned model on a driver (that's stay online with created SparkContext) to use it for any subsequent predictions, but these 10 seconds predictions basically kill the whole idea. Is it possible somehow to distribute the model over the cluster upfront so that the prediction is really fast? Are there any specific params to apply to the PipelineModel to stay resident on a worker nodes? Anything to keep and reuse broadcasted data? Thanks in advance. -- Be well! Jean Morozov