Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nick Pentreath
Currently there is no direct way in Spark to serve models without bringing in all of Spark as a dependency. For Spark ML, there is actually no way to do it independently of DataFrames either (which for single-instance prediction makes things sub-optimal). That is covered here: https://issues.apach

Who controls 'databricks-jenkins'?

2016-08-11 Thread Sean Owen
Not a big deal but 'he' is commenting on a lot of ancient PRs for some reason, like https://github.com/apache/spark/pull/51 and it generates mails to the list. I assume this is a misconfiguration somewhere. - To unsubscribe e-mail

Re: Sorting within partitions is not maintained in parquet?

2016-08-11 Thread Hyukjin Kwon
I just took a quick look for this. It seems not parquet-specific problem but for datasources implimenting FileFormat. In 1.6, it seems apparently partitions are made per file but in 2.0 partition can hold multiple files. So, in your case files are miltiple but partitions are fewer, meaning each p

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nicholas Chammas
Thanks Michael for the reference, and thanks Nick for the comprehensive overview of existing JIRA discussions about this. I've added myself as a watcher on the various tasks. On Thu, Aug 11, 2016 at 3:02 AM Nick Pentreath wrote: > Currently there is no direct way in Spark to serve models without

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Chris Fregly
this is exactly what my http://pipeline.io project is addressing. check it out and send me feedback or create issues at that github location. > On Aug 11, 2016, at 7:42 AM, Nicholas Chammas > wrote: > > Thanks Michael for the reference, and thanks Nick for the comprehensive > overview of exi

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Chris Fregly
And here's a recent slide deck on the pipeline.io that summarizes what we're working on (all open source): https://www.slideshare.net/mobile/cfregly/advanced-spark-and-tensorflow-meetup-08042016-one-click-spark-ml-pipeline-deploy-to-production mleap is heading the wrong direction and reinventi

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nicholas Chammas
Thanks for the additional reference Chris. Sounds like there are a few independent projects addressing this story. On Thu, Aug 11, 2016 at 12:42 PM Chris Fregly wrote: > And here's a recent slide deck on the pipeline.io that summarizes what > we're working on (all open source): > > > https://www

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Michael Allman
Hi Chris, I was just checking out your project. I mentioned we use MLeap to serve predictions from a trained Spark ML RandomForest model. How would I do that with pipeline.io ? It isn't clear to me. Thanks! Michael > On Aug 11, 2016, at 9:42 AM, Chris Fregly wrote: > >

Re: Sorting within partitions is not maintained in parquet?

2016-08-11 Thread Michael Armbrust
This is an optimization to avoid overloading the scheduler with many small tasks. It bin-packs data into tasks based on the file size. You can disable it by setting spark.sql.files.openCostInBytes very high (higher than spark.sql.files.maxPartitionBytes). On Thu, Aug 11, 2016 at 4:27 AM, Hyukjin