Along the lines of #1: the spark packages seemed to have had a good start about two years ago: but now there are not more than a handful in general use - e.g. databricks CSV. When the available packages are browsed the majority are incomplete, empty, unmaintained, or unclear.
Any ideas on how to resurrect spark packages in a way that there will be sufficient adoption for it to be meaningful? 2017-01-23 17:03 GMT-08:00 Joseph Bradley <jos...@databricks.com>: > This thread is split off from the "Feedback on MLlib roadmap process > proposal" thread for discussing the high-level mission and goals for > MLlib. I hope this thread will collect feedback and ideas, not necessarily > lead to huge decisions. > > Copying from the previous thread: > > *Seth:* > """ > I would love to hear some discussion on the higher level goal of Spark > MLlib (if this derails the original discussion, please let me know and we > can discuss in another thread). The roadmap does contain specific items > that help to convey some of this (ML parity with MLlib, model persistence, > etc...), but I'm interested in what the "mission" of Spark MLlib is. We > often see PRs for brand new algorithms which are sometimes rejected and > sometimes not. Do we aim to keep implementing more and more algorithms? Or > is our focus really, now that we have a reasonable library of algorithms, > to simply make the existing ones faster/better/more robust? Should we aim > to make interfaces that are easily extended for developers to easily > implement their own custom code (e.g. custom optimization libraries), or do > we want to restrict things to out-of-the box algorithms? Should we focus on > more flexible, general abstractions like distributed linear algebra? > > I was not involved in the project in the early days of MLlib when this > discussion may have happened, but I think it would be useful to either > revisit it or restate it here for some of the newer developers. > """ > > *Mingjie:* > """ > +1 general abstractions like distributed linear algebra. > """ > > > I'll add my thoughts, starting with our past *t**rajectory*: > * Initially, MLlib was mainly trying to build a set of core algorithms. > * Two years ago, the big effort was adding Pipelines. > * In the last year, big efforts have been around completing Pipelines and > making the library more robust. > > I agree with Seth that a few *immediate goals* are very clear: > * feature parity for DataFrame-based API > * completing and improving testing for model persistence > * Python, R parity > > *In the future*, it's harder to say, but if I had to pick my top 2 items, > I'd list: > > *(1) Making MLlib more extensible* > It will not be feasible to support a huge number of algorithms, so > allowing users to customize their ML on Spark workflows will be critical. > This is IMO the most important thing we could do for MLlib. > Part of this could be building a healthy community of Spark Packages, and > we will need to make it easier for users to write their own algorithms and > packages to facilitate this. Part of this could be allowing users to > customize existing algorithms with custom loss functions, etc. > > *(2) Consistent improvements to core algorithms* > A less exciting but still very important item will be constantly improving > the core set of algorithms in MLlib. This could mean speed, scaling, > robustness, and usability for the few algorithms which cover 90% of use > cases. > > There are plenty of other possibilities, and it will be great to hear the > community's thoughts! > > Thanks, > Joseph > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> >