In reading through this and thinking about usability is there any interest in 
building a performance measurement framework around some (or maybe all) of the 
ML/Lib algorithms, I envision this as something that can get run for each 
release build for our end users, it may be useful for internal ml devs to see 
what impact each change to their code has on performance, please pardon me if 
this already exists, am new to the codebase and contributing to spark.


________________________________
From: Asher Krim <ak...@hubspot.com>
Sent: Tuesday, January 24, 2017 12:17 PM
To: Miao Wang
Cc: java...@gmail.com; dev@spark.apache.org; Sean Owen
Subject: Re: MLlib mission and goals

On the topic of usability, I think more effort should be put into large scale 
testing. We've encountered issues with building large models that are not 
apparent in small models, and these issues have made productizing ML/MLLIB much 
more difficult than we first anticipated. Considering that one of the biggest 
selling points for Spark is ease of scaling to large datasets, I think fleshing 
out SPARK-15573 and testing large models should be a priority

On Tue, Jan 24, 2017 at 2:23 PM, Miao Wang 
<wangm...@us.ibm.com<mailto:wangm...@us.ibm.com>> wrote:
I started working on ML/MLLIB/R since last year. Here are some of my thoughts 
from a beginner's perspective:

Current ML/MLLIB core algorithms can serve as good implementation examples, 
which makes adding new algorithms easier. Even a beginner like me, can pick it 
up quickly and learn how to add new algorithms. So, adding new algorithms 
should not be a barrier for developers who really need specific algorithms and 
it should not be the first priority in ML/MLLIB long term goal. We should only 
add highly demanded algorithms. I hope there will be detailed JIRA/email 
discussions to decide whether we want to accept a new algorithm.

I strongly agree that we should improve ML/MLLIB usability, stability and 
performance in core algorithms and the foundations such as linear algebra 
library etc. This will keep Spark ML/MLLIB competitive in the area of machine 
learning framework. For example, Microsoft just open source a fast, 
distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) 
framework based on decision tree algorithms. The performance and accuracy is 
much better than xboost. We need to follow up and improve Spark GBT alogrithms 
in near future.

Another related area is SparkR. API Parity between SparkR and ML/MLLIB is 
important. We should also pay attention to R users' habits and experiences when 
maintaining API parity.

Miao

----- Original message -----
From: Stephen Boesch <java...@gmail.com<mailto:java...@gmail.com>>
To: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>
Cc: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Re: MLlib mission and goals
Date: Tue, Jan 24, 2017 4:42 AM

re: spark-packages.org<http://spark-packages.org>  and "Would these really be 
better in the core project?"   That was not at all the intent of my input: 
instead to ask "how and where to structure/place deployment quality code that 
yet were *not* part of the distribution?"   The spark packages has no curation 
whatsoever : no minimum standards of code quality and deployment structures, 
let alone qualitative measures of usefulness.

While spark packages would never rival CRAN and friends there is not even any 
mechanism in place to get started.  From the CRAN site:

   Even at the current growth rate of several packages a day, all submissions 
are still rigorously quality-controlled using strong testing features available 
in the R system .

Maybe give something that has a subset of these processes a try ?  Different 
folks than are already over-subscribed in MLlib ?

2017-01-24 2:37 GMT-08:00 Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>>:
My $0.02, which shouldn't be weighted too much.

I believe the mission as of Spark ML has been to provide the framework, and 
then implementation of 'the basics' only. It should have the tools that cover 
~80% of use cases, out of the box, in a pretty well-supported and tested way.

It's not a goal to support an arbitrarily large collection of algorithms 
because each one adds marginally less value, and IMHO, is proportionally bigger 
baggage, because the contributors tend to skew academic, produce worse code, 
and don't stick around to maintain it.

The project is already generally quite overloaded; I don't know if there's 
bandwidth to even cover the current scope. While 'the basics' is a subjective 
label, de facto, I think we'd have to define it as essentially "what we already 
have in place" for the foreseeable future.

That the bits on spark-packages.org<http://spark-packages.org> aren't so hot is 
not a problem but a symptom. Would these really be better in the core project?

And, or: I entirely agree with Joseph's take.

On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" 
thread for discussing the high-level mission and goals for MLlib.  I hope this 
thread will collect feedback and ideas, not necessarily lead to huge decisions.

Copying from the previous thread:

Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib 
(if this derails the original discussion, please let me know and we can discuss 
in another thread). The roadmap does contain specific items that help to convey 
some of this (ML parity with MLlib, model persistence, etc...), but I'm 
interested in what the "mission" of Spark MLlib is. We often see PRs for brand 
new algorithms which are sometimes rejected and sometimes not. Do we aim to 
keep implementing more and more algorithms? Or is our focus really, now that we 
have a reasonable library of algorithms, to simply make the existing ones 
faster/better/more robust? Should we aim to make interfaces that are easily 
extended for developers to easily implement their own custom code (e.g. custom 
optimization libraries), or do we want to restrict things to out-of-the box 
algorithms? Should we focus on more flexible, general abstractions like 
distributed linear algebra?

I was not involved in the project in the early days of MLlib when this 
discussion may have happened, but I think it would be useful to either revisit 
it or restate it here for some of the newer developers.
"""

Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and 
making the library more robust.

I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

In the future, it's harder to say, but if I had to pick my top 2 items, I'd 
list:

(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing 
users to customize their ML on Spark workflows will be critical.  This is IMO 
the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we 
will need to make it easier for users to write their own algorithms and 
packages to facilitate this.  Part of this could be allowing users to customize 
existing algorithms with custom loss functions, etc.

(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the 
core set of algorithms in MLlib. This could mean speed, scaling, robustness, 
and usability for the few algorithms which cover 90% of use cases.

There are plenty of other possibilities, and it will be great to hear the 
community's thoughts!

Thanks,
Joseph



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>



--------------------------------------------------------------------- To 
unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>



--
Asher Krim
Senior Software Engineer
[http://cdn2.hubspot.net/hub/137828/file-223457316-png/HubSpot_User_Group_Images/HUG_lrg_HS.png?t=1477096082917]

Reply via email to