I am glad to see DB’s comments, make me feel I am not the only one facing these issues. If we are able to use MLLib to load the model in web applications (outside the spark cluster), that would have solved the issue. I understand Spark is manly for processing big data in a distributed mode. But, there is no purpose in training a model using MLLib, if we are not able to use it in applications where needs to access the model.
Thanks Viju From: DB Tsai [mailto:dbt...@dbtsai.com] Sent: Thursday, November 12, 2015 11:04 AM To: Sean Owen Cc: Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user @spark; Xiangrui Meng; hol...@pigscanfly.ca Subject: Re: thought experiment: use spark ML to real time prediction I think the use-case can be quick different from PMML. By having a Spark platform independent ML jar, this can empower users to do the following, 1) PMML doesn't contain all the models we have in mllib. Also, for a ML pipeline trained by Spark, most of time, PMML is not expressive enough to do all the transformation we have in Spark ML. As a result, if we are able to serialize the entire Spark ML pipeline after training, and then load them back in app without any Spark platform for production scorning, this will be very useful for production deployment of Spark ML models. The only issue will be if the transformer involves with shuffle, we need to figure out a way to handle it. When I chatted with Xiangrui about this, he suggested that we may tag if a transformer is shuffle ready. Currently, at Netflix, we are not able to use ML pipeline because of those issues, and we have to write our own scorers in our production which is quite a duplicated work. 2) If users can use Spark's linear algebra like vector or matrix code in their application, this will be very useful. This can help to share code in Spark training pipeline and production deployment. Also, lots of good stuff at Spark's mllib doesn't depend on Spark platform, and people can use them in their application without pulling lots of dependencies. In fact, in my project, I have to copy & paste code from mllib into my project to use those goodies in apps. 3) Currently, mllib depends on graphx which means in graphx, there is no way to use mllib's vector or matrix. And at Netflix, we implemented parallel personalized page rank which requires to use sparse vector as part of public api. We have to use breeze here since no access to mllib's basic type in graphx. Before we contribute it back to open source community, we need to address this. Sincerely, DB Tsai ---------------------------------------------------------- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>> wrote: This is all starting to sound a lot like what's already implemented in Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm not clear it helps a lot to reimplement this in Spark. On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote: +1 on that. It would be useful to use the model outside of Spark. _____________________________ From: DB Tsai <dbt...@dbtsai.com<mailto:dbt...@dbtsai.com>> Sent: Wednesday, November 11, 2015 11:57 PM Subject: Re: thought experiment: use spark ML to real time prediction To: Nirmal Fernando <nir...@wso2.com<mailto:nir...@wso2.com>> Cc: Andy Davidson <a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>>, user @spark <user@spark.apache.org<mailto:user@spark.apache.org>> Do you think it will be useful to separate those models and model loader/writer code into another spark-ml-common jar without any spark platform dependencies so users can load the models trained by Spark ML in their application and run the prediction? Sincerely, DB Tsai ---------------------------------------------------------- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando <nir...@wso2.com<mailto:nir...@wso2.com>> wrote: As of now, we are basically serializing the ML model and then deserialize it for prediction at real time. On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>> wrote: I don’t think this answers your question but here’s how you would evaluate the model in realtime in a streaming app https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html Maybe you can find a way to extract portions of MLLib and run them outside of spark – loading the precomputed model and calling .predict on it… -adrian From: Andy Davidson Date: Tuesday, November 10, 2015 at 11:31 PM To: "user @spark" Subject: thought experiment: use spark ML to real time prediction Lets say I have use spark ML to train a linear model. I know I can save and load the model to disk. I am not sure how I can use the model in a real time environment. For example I do not think I can return a “prediction” to the client using spark streaming easily. Also for some applications the extra latency created by the batch process might not be acceptable. If I was not using spark I would re-implement the model I trained in my batch environment in a lang like Java and implement a rest service that uses the model to create a prediction and return the prediction to the client. Many models make predictions using linear algebra. Implementing predictions is relatively easy if you have a good vectorized LA package. Is there a way to use a model I trained using spark ML outside of spark? As a motivating example, even if its possible to return data to the client using spark streaming. I think the mini batch latency would not be acceptable for a high frequency stock trading system. Kind regards Andy P.s. The examples I have seen so far use spark streaming to “preprocess” predictions. For example a recommender system might use what current users are watching to calculate “trending recommendations”. These are stored on disk and served up to users when the use the “movie guide”. If a recommendation was a couple of min. old it would not effect the end users experience. -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733<tel:+94715779733> Blog: http://nirmalfdo.blogspot.com/ [http://c.content.wso2.com/signatures/general.png] ---------------------------------------------------------------------- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.