FYI I've started going through a few of the top Watched JIRAs and tried to identify those that are obviously stale and can probably be closed, to try to clean things up a bit.
On Thu, 23 Feb 2017 at 21:38 Tim Hunter <timhun...@databricks.com> wrote: > As Sean wrote very nicely above, the changes made to Spark are decided in > an organic fashion based on the interests and motivations of the committers > and contributors. The case of deep learning is a good example. There is a > lot of interest, and the core algorithms could be implemented without too > much problem in a few thousands of lines of scala code. However, the > performance of such a simple implementation would be one to two order of > magnitude slower than what would get from the popular frameworks out there. > > At this point, there are probably more man-hours invested in TensorFlow > (as an example) than in MLlib, so I think we need to be realistic about > what we can expect to achieve inside Spark. Unlike BLAS for linear algebra, > there is no agreed-up interface for deep learning, and each of the XOnSpark > flavors explores a slightly different design. It will be interesting to see > what works well in practice. In the meantime, though, there are plenty of > things that we could do to help developers of other libraries to have a > great experience with Spark. Matei alluded to that in his Spark Summit > keynote when he mentioned better integration with low-level libraries. > > Tim > > > On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Sorry for being late to the discussion. I think Joseph, Sean and others > have covered the issues well. > > Overall I like the proposed cleaned up roadmap & process (thanks Joseph!). > As for the actual critical roadmap items mentioned on SPARK-18813, I think > it makes sense and will comment a bit further on that JIRA. > > I would like to encourage votes & watching for issues to give a sense of > what the community wants (I guess Vote is more explicit yet passive, while > actually Watching an issue is more informative as it may indicate a real > use case dependent on the issue?!). > > I think if used well this is valuable information for contributors. Of > course not everything on that list can get done. But if I look through the > top votes or watch list, while not all of those are likely to go in, a > great many of the issues are fairly non-contentious in terms of being good > additions to the project. > > Things like these are good examples IMO (I just sample a few of them, not > exhaustive): > - sample weights for RF / DT > - multi-model and/or parallel model selection > - make sharedParams public? > - multi-column support for various transformers > - incremental model training > - tree algorithm enhancements > > Now, whether these can be prioritised in terms of bandwidth available to > reviewers and committers is a totally different thing. But as Sean mentions > there is some process there for trying to find the balance of the issue > being a "good thing to add", a shepherd with bandwidth & interest in the > issue to review, and the maintenance burden imposed. > > Let's take Deep Learning / NN for example. Here's a good example of > something that has a lot of votes/watchers and as Sean mentions it is > something that "everyone wants someone else to implement". In this case, > much of the interest may in fact be "stale" - 2 years ago it would have > been very interesting to have a strong DL impl in Spark. Now, because there > are a plethora of very good DL libraries out there, how many of those Votes > would be "deleted"? Granted few are well integrated with Spark but that can > and is changing (DL4J, BigDL, the "XonSpark" flavours etc). > > So this is something that I dare say will not be in Spark any time in the > foreseeable future or perhaps ever given the current status. Perhaps it's > worth seriously thinking about just closing these kind of issues? > > > > On Fri, 27 Jan 2017 at 05:53 Joseph Bradley <jos...@databricks.com> wrote: > > Sean has given a great explanation. A few more comments: > > Roadmap: I have been creating roadmap JIRAs, but the goal really is to > have all committers working on MLlib help to set that roadmap, based on > either their knowledge of current maintenance/internal needs of the project > or the feedback given from the rest of the community. > @Committers - I see people actively shepherding PRs for MLlib, but I don't > see many major initiatives linked to the roadmap. If there are ones large > enough to merit adding to the roadmap, please do. > > In general, there are many process improvements we could make. A few in > my mind are: > * Visibility: Let the community know what committers are focusing on. > This was the primary purpose of the "MLlib roadmap proposal." > * Community initiatives: This is currently very organic. Some of the > organic process could be improved, such as encouraging Votes/Watchers > (though I agree with Sean about these being one-sided metrics). Cody's SIP > work is a great step towards adding more clarity and structure for major > initiatives. > * JIRA hygiene: Always a challenge, and always requires some manual > prodding. But it's great to push for efforts on this. > > > On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen <so...@cloudera.com> wrote: > > On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach <il...@microsoft.com> wrote: > > My confusion was that the ML 2.2 roadmap critical features ( > https://issues.apache.org/jira/browse/SPARK-18813) did not line up with > the top ML/MLLIB JIRAs by Votes > <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520votes%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=%2FtFB0LY%2BIxLoEf%2FPr1i1%2FgvrjlpXPuYLSLbpnd89Tkg%3D&reserved=0>or > Watchers > <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520Watchers%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=XkPfFiB2T%2FoVnJcdr3jf12dQjes7w%2BVJMrbhgx3ELRs%3D&reserved=0> > . > > Your explanation that they do not have to and there is a more complex > process to choosing the changes that will make it into the next release > makes sense to me. > > > For Spark ML, Joseph is the de facto leader and does publish a tentative > roadmap. (We could also use JIRA mechanisms for this but any scheme is > better than none.) Yes, not based on Votes -- nothing here is. Votes are > noisy signal because it is usually measures: what would you like done if > you didn't have to do it and there were no downsides for you? > > > > My only humble recommendation would be to cleanup the top JIRAs by closing > the ones which have spark packages for them (eg the NN one which already > has several packages as you explained), noting or somehow marking on some > that they will not be resolved, and changing the component on the ones not > related to ML/MLLIB (eg https://issues.apache.org/jira/browse/SPARK-12965 > ). > > > We do that. It occasionally generates protests, so, I find myself erring > on the side of ignoring. You can comment on any JIRA you think should be > closed. That's helpful. > > That particular JIRA seems potentially legitimate. I wouldn't close it. It > also won't get fixed until someone proposes a resolution. I'd strongly > encourage people saying "I have this problem too" to try to fix it. I tend > to ignore these otherwise, myself, in favor of reviewing ones where someone > has gone to the trouble of proposing a working fix. > > > > Also, I would love to do this if I had the permissions, but it would be > great to change the JIRAs that are marked as “in progress” but where the > corresponding pull request was closed/cancelled, for example > https://issues.apache.org/jira/browse/SPARK-4638. That JIRA is > > > Yes, flag these. I or others can close them if appropriate. Anyone who > consistently does this well, we could give JIRA permissions to. > > Opening a PR automatically makes it "In Progress" but there's no > complementary process to un-mark it. You can ignore the Open / In Progress > distinction really. > > This one is interesting because it does seem like a plausible feature to > add. The original PR was abandoned by the author and nobody else submitted > one -- despite the Votes. I hesitate to signal that no PRs would be > considered, but, doesn't seem like it's in demand enough for someone to > work on? > > > I think one of my messages is that, de facto, here, like in many Apache > projects, committers do not take requests. They pursue the work they > believe needs doing, and shepherd work initiated by others (a clear bug > report, a PR) to a resolution. Things get done by doing them, or by > building influence by doing other things the project needs doing. It isn't > a mechanical, objective process, and can't be. But it does work in a > recognizable way. > > > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> > > >