Hi Sean and Aseem, thanks both. A simple thing which sped things up greatly was simply to load our sql (for one record effectively) directly and then convert to a dataframe, rather than using Spark to load it. Sounds stupid, but this took us from > 5 seconds to ~1 second on a very small instance.
Aseem: can you explain your solution a bit more? I'm not sure I understand it. At the moment we load our models from S3 (RandomForestClassificationModel.load(..) ) and then store that in an object property so that it persists across requests - this is in Scala. Is this essentially what you mean? On 12 October 2016 at 10:52, Aseem Bansal <asmbans...@gmail.com> wrote: > Hi > > Faced a similar issue. Our solution was to load the model, cache it after > converting it to a model from mllib and then use that instead of ml model. > > On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen <so...@cloudera.com> wrote: > >> I don't believe it will ever scale to spin up a whole distributed job to >> serve one request. You can look possibly at the bits in mllib-local. You >> might do well to export as something like PMML either with Spark's export >> or JPMML and then load it into a web container and score it, without Spark >> (possibly also with JPMML, OpenScoring) >> >> >> On Tue, Oct 11, 2016, 17:53 Nicolas Long <nicolasl...@gmail.com> wrote: >> >>> Hi all, >>> >>> so I have a model which has been stored in S3. And I have a Scala webapp >>> which for certain requests loads the model and transforms submitted data >>> against it. >>> >>> I'm not sure how to run this quickly on a single instance though. At the >>> moment Spark is being bundled up with the web app in an uberjar (sbt >>> assembly). >>> >>> But the process is quite slow. I'm aiming for responses < 1 sec so that >>> the webapp can respond quickly to requests. When I look the Spark UI I see: >>> >>> Summary Metrics for 1 Completed Tasks >>> >>> Metric Min 25th percentile Median 75th percentile Max >>> Duration 94 ms 94 ms 94 ms 94 ms 94 ms >>> Scheduler Delay 0 ms 0 ms 0 ms 0 ms 0 ms >>> Task Deserialization Time 3 s 3 s 3 s 3 s 3 s >>> GC Time 2 s 2 s 2 s 2 s 2 s >>> Result Serialization Time 0 ms 0 ms 0 ms 0 ms 0 ms >>> Getting Result Time 0 ms 0 ms 0 ms 0 ms 0 ms >>> Peak Execution Memory 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B >>> >>> I don't really understand why deserialization and GC should take so long >>> when the models are already loaded. Is this evidence I am doing something >>> wrong? And where can I get a better understanding on how Spark works under >>> the hood here, and how best to do a standalone/bundled jar deployment? >>> >>> Thanks! >>> >>> Nic >>> >> >