RE: Feedback on MLlib roadmap process proposal

Ilya Matiach Tue, 24 Jan 2017 07:59:37 -0800

Just a few questions with regards to the MLLIB process:


  1.  Is there a list of committers who can/are shepherds and what code they 
own?  I’ve seen this page: http://spark.apache.org/committers.html but I’m not 
sure if it is up to date and it doesn’t mention what code the committers own.  
It would be useful to know who owns ML or MLLIB.  From my limited personal 
experience this seems to be Joseph K. Bradley, Yanbo Liang and Sean Owen.
  2.  Based on both user votes and watchers, the top issue currently is 
“SPARK-5575: Artificial neural networks for MLlib deep learning”.  However, it 
looks like it has been opened for almost 2 years and not a lot of progress is 
being made.  There seem to be other top issues which aren’t getting addressed 
as well on these pages mentioned in the roadmap: MLlib, sorted by: Votes 
<https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC>
 or Watchers 
<https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC>
 .  Is my perception incorrect, or is there a very good reason for not 
addressing the top issues voted for by the community?  If there is a good 
reason, is there a way to filter such JIRAs out from the sorted lists, to know 
which JIRAs really should be taken/worked on?
  3.  Also, this might be a newbie question, but for new contributors to spark, 
is there a process to convince a committer to be assigned to a JIRA that we are 
working on. It would be useful if there was a clear threshold for whether a 
committer can reject to work on a JIRA ahead of time, so contributors won’t 
waste time working on issues that aren’t important to spark and focus on making 
progress on the issues that the spark committers would like us to fix.

Thank you, Ilya

From: Joseph Bradley [mailto:[email protected]]
Sent: Monday, January 23, 2017 8:04 PM
To: Felix Cheung <[email protected]>
Cc: Mingjie Tang <[email protected]>; Seth Hendrickson 
<[email protected]>; [email protected]
Subject: Re: Feedback on MLlib roadmap process proposal

Hi Seth,

The proposal is geared towards exactly the issue you're describing: providing 
more visibility into the capacity and intentions of committers.  If there are 
things you'd add to it or change to improve further, it would be great to hear 
ideas!  The past roadmap JIRA has some more background discussion which is 
worth looking at too.

Let's break off the MLlib mission discussion into another thread.  I'll start 
one now.

Thanks,
Joseph

On Thu, Jan 19, 2017 at 1:51 PM, Felix Cheung 
<[email protected]<mailto:[email protected]>> wrote:
Hi Seth

Re: "The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. "

We are adopting a Shepherd model, as described in the JIRA Joseph has, in 
which, when assigned, the Shepherd will see it through with the contributor to 
make sure it lands with the target release.

I'm sure Joseph can explain it better than I do ;)

_____________________________
From: Mingjie Tang <[email protected]<mailto:[email protected]>>
Sent: Thursday, January 19, 2017 10:30 AM
Subject: Re: Feedback on MLlib roadmap process proposal
To: Seth Hendrickson 
<[email protected]<mailto:[email protected]>>
Cc: Joseph Bradley <[email protected]<mailto:[email protected]>>, 
<[email protected]<mailto:[email protected]>>


+1 general abstractions like distributed linear algebra.

On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson 
<[email protected]<mailto:[email protected]>> wrote:
I think the proposal laid out in SPARK-18813 is well done, and I do think it is 
going to improve the process going forward. I also really like the idea of 
getting the community to vote on JIRAs to give some of them priority - provided 
that we listen to those votes, of course. The biggest problem I see is that we 
do have several active contributors and those who want to help implement these 
changes, but PRs are reviewed rather sporadically and I imagine it is very 
difficult for contributors to understand why some get reviewed and some do not. 
The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. A hard thing to do in open source, no doubt, but 
even if we have to limit the scope of such issues to a very small subset, it's 
a gain for all I think.

On a related note, I would love to hear some discussion on the higher level 
goal of Spark MLlib (if this derails the original discussion, please let me 
know and we can discuss in another thread). The roadmap does contain specific 
items that help to convey some of this (ML parity with MLlib, model 
persistence, etc...), but I'm interested in what the "mission" of Spark MLlib 
is. We often see PRs for brand new algorithms which are sometimes rejected and 
sometimes not. Do we aim to keep implementing more and more algorithms? Or is 
our focus really, now that we have a reasonable library of algorithms, to 
simply make the existing ones faster/better/more robust? Should we aim to make 
interfaces that are easily extended for developers to easily implement their 
own custom code (e.g. custom optimization libraries), or do we want to restrict 
things to out-of-the box algorithms? Should we focus on more flexible, general 
abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this 
discussion may have happened, but I think it would be useful to either revisit 
it or restate it here for some of the newer developers.

On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

This is a general call for thoughts about the process for the MLlib roadmap 
proposed in SPARK-18813.  See the section called "Roadmap process."

Summary:
* This process is about committers indicating intention to shepherd and review.
* The goal is to improve visibility and communication.
* This is fairly orthogonal to the SIP discussion since this proposal is more 
about setting release targets than about proposing future plans.

Thanks!
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F&data=02%7C01%7Cilmat%40microsoft.com%7C4039ae5fd4ef4b3adf2f08d443f4d6c9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208166242129223&sdata=yiBhrmwwGrCsV1YPvqfOnYXug9ZPVhgROO53xxCP0JE%3D&reserved=0>






--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F&data=02%7C01%7Cilmat%40microsoft.com%7C4039ae5fd4ef4b3adf2f08d443f4d6c9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208166242129223&sdata=yiBhrmwwGrCsV1YPvqfOnYXug9ZPVhgROO53xxCP0JE%3D&reserved=0>

RE: Feedback on MLlib roadmap process proposal

Reply via email to