I am glad to see DB’s comments, make me feel I am not the only one facing these 
issues. If we are able to use MLLib to load the model in web applications 
(outside the spark cluster), that would have solved the issue.  I understand 
Spark is manly for processing big data in a distributed mode. But, there is no 
purpose in training a model using MLLib, if we are not able to use it in 
applications where needs to access the model.

Thanks
Viju

From: DB Tsai [mailto:dbt...@dbtsai.com]
Sent: Thursday, November 12, 2015 11:04 AM
To: Sean Owen
Cc: Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user @spark; 
Xiangrui Meng; hol...@pigscanfly.ca
Subject: Re: thought experiment: use spark ML to real time prediction

I think the use-case can be quick different from PMML.

By having a Spark platform independent ML jar, this can empower users to do the 
following,

1) PMML doesn't contain all the models we have in mllib. Also, for a ML 
pipeline trained by Spark, most of time, PMML is not expressive enough to do 
all the transformation we have in Spark ML. As a result, if we are able to 
serialize the entire Spark ML pipeline after training, and then load them back 
in app without any Spark platform for production scorning, this will be very 
useful for production deployment of Spark ML models. The only issue will be if 
the transformer involves with shuffle, we need to figure out a way to handle 
it. When I chatted with Xiangrui about this, he suggested that we may tag if a 
transformer is shuffle ready. Currently, at Netflix, we are not able to use ML 
pipeline because of those issues, and we have to write our own scorers in our 
production which is quite a duplicated work.

2) If users can use Spark's linear algebra like vector or matrix code in their 
application, this will be very useful. This can help to share code in Spark 
training pipeline and production deployment. Also, lots of good stuff at 
Spark's mllib doesn't depend on Spark platform, and people can use them in 
their application without pulling lots of dependencies. In fact, in my project, 
I have to copy & paste code from mllib into my project to use those goodies in 
apps.

3) Currently, mllib depends on graphx which means in graphx, there is no way to 
use mllib's vector or matrix. And at Netflix, we implemented parallel 
personalized page rank which requires to use sparse vector as part of public 
api. We have to use breeze here since no access to mllib's basic type in 
graphx. Before we contribute it back to open source community, we need to 
address this.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:
This is all starting to sound a lot like what's already implemented in 
Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm not 
clear it helps a lot to reimplement this in Spark.

On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
+1 on that. It would be useful to use the model outside of Spark.

_____________________________
From: DB Tsai <dbt...@dbtsai.com<mailto:dbt...@dbtsai.com>>
Sent: Wednesday, November 11, 2015 11:57 PM
Subject: Re: thought experiment: use spark ML to real time prediction
To: Nirmal Fernando <nir...@wso2.com<mailto:nir...@wso2.com>>
Cc: Andy Davidson 
<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>, Adrian 
Tanase <atan...@adobe.com<mailto:atan...@adobe.com>>, user @spark 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Do you think it will be useful to separate those models and model loader/writer 
code into another spark-ml-common jar without any spark platform dependencies 
so users can load the models trained by Spark ML in their application and run 
the prediction?


Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando 
<nir...@wso2.com<mailto:nir...@wso2.com>> wrote:
As of now, we are basically serializing the ML model and then deserialize it 
for prediction at real time.

On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase 
<atan...@adobe.com<mailto:atan...@adobe.com>> wrote:
I don’t think this answers your question but here’s how you would evaluate the 
model in realtime in a streaming app
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

Maybe you can find a way to extract portions of MLLib and run them outside of 
spark – loading the precomputed model and calling .predict on it…

-adrian

From: Andy Davidson
Date: Tuesday, November 10, 2015 at 11:31 PM
To: "user @spark"
Subject: thought experiment: use spark ML to real time prediction

Lets say I have use spark ML to train a linear model. I know I can save and 
load the model to disk. I am not sure how I can use the model in a real time 
environment. For example I do not think I can return a “prediction” to the 
client using spark streaming easily. Also for some applications the extra 
latency created by the batch process might not be acceptable.

If I was not using spark I would re-implement the model I trained in my batch 
environment in a lang like Java  and implement a rest service that uses the 
model to create a prediction and return the prediction to the client. Many 
models make predictions using linear algebra. Implementing predictions is 
relatively easy if you have a good vectorized LA package. Is there a way to use 
a model I trained using spark ML outside of spark?

As a motivating example, even if its possible to return data to the client 
using spark streaming. I think the mini batch latency would not be acceptable 
for a high frequency stock trading system.

Kind regards

Andy

P.s. The examples I have seen so far use spark streaming to “preprocess” 
predictions. For example a recommender system might use what current users are 
watching to calculate “trending recommendations”. These are stored on disk and 
served up to users when the use the “movie guide”. If a recommendation was a 
couple of min. old it would not effect the end users experience.



--

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733<tel:+94715779733>
Blog: http://nirmalfdo.blogspot.com/

[http://c.content.wso2.com/signatures/general.png]





----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.

Reply via email to