It turned out suggested edits (trackable) don't show up for non-owners, so I've just merged all the edits in place. It should be visible now.
On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> wrote: > Oops. Let me try figure that out. > > > On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote: > >> Thanks for picking up on this. >> >> Maybe I fail at google docs, but I can't see any edits on the document >> you linked. >> >> Regarding lazy consensus, if the board in general has less of an issue >> with that, sure. As long as it is clearly announced, lasts at least >> 72 hours, and has a clear outcome. >> >> The other points are hard to comment on without being able to see the >> text in question. >> >> >> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote: >> > I just looked through the entire thread again tonight - there are a lot >> of >> > great ideas being discussed. Thanks Cody for taking the first crack at >> the >> > proposal. >> > >> > I want to first comment on the context. Spark is one of the most >> innovative >> > and important projects in (big) data -- overall technical decisions >> made in >> > Apache Spark are sound. But of course, a project as large and active as >> > Spark always have room for improvement, and we as a community should >> strive >> > to take it to the next level. >> > >> > To that end, the two biggest areas for improvements in my opinion are: >> > >> > 1. Visibility: There are so much happening that it is difficult to know >> what >> > really is going on. For people that don't follow closely, it is >> difficult to >> > know what the important initiatives are. Even for people that do >> follow, it >> > is difficult to know what specific things require their attention, >> since the >> > number of pull requests and JIRA tickets are high and it's difficult to >> > extract signal from noise. >> > >> > 2. Solicit user (broadly defined, including developers themselves) input >> > more proactively: At the end of the day the project provides value >> because >> > users use it. Users can't tell us exactly what to build, but it is >> important >> > to get their inputs. >> > >> > >> > I've taken Cody's doc and edited it: >> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x- >> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b >> > (I've made all my modifications trackable) >> > >> > There are couple high level changes I made: >> > >> > 1. I've consulted a board member and he recommended lazy consensus as >> > opposed to voting. The reason being in voting there can easily be a >> "loser' >> > that gets outvoted. >> > >> > 2. I made it lighter weight, and renamed "strategy" to "optional design >> > sketch". Echoing one of the earlier email: "IMHO so far aside from >> tagging >> > things and linking them elsewhere simply having design docs and >> prototypes >> > implementations in PRs is not something that has not worked so far". >> > >> > 3. I made some the language tweaks to focus more on visibility. For >> example, >> > "The purpose of an SIP is to inform and involve", rather than just >> > "involve". SIPs should also have at least two emails that go to dev@. >> > >> > >> > While I was editing this, I thought we really needed a suggested >> template >> > for design doc too. I will get to that too ... >> > >> > >> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> >> wrote: >> >> >> >> Most things looked OK to me too, although I do plan to take a closer >> look >> >> after Nov 1st when we cut the release branch for 2.1. >> >> >> >> >> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com> >> >> wrote: >> >>> >> >>> The proposal looks OK to me. I assume, even though it's not explicitly >> >>> called, that voting would happen by e-mail? A template for the >> >>> proposal document (instead of just a bullet nice) would also be nice, >> >>> but that can be done at any time. >> >>> >> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate >> >>> for a SIP, given the scope of the work. The document attached even >> >>> somewhat matches the proposed format. So if anyone wants to try out >> >>> the process... >> >>> >> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org> >> >>> wrote: >> >>> > Now that spark summit europe is over, are any committers interested >> in >> >>> > moving forward with this? >> >>> > >> >>> > >> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i >> mprovement-proposals.md >> >>> > >> >>> > Or are we going to let this discussion die on the vine? >> >>> > >> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >> >>> > <tomasz.gaw...@outlook.com> wrote: >> >>> >> Maybe my mail was not clear enough. >> >>> >> >> >>> >> >> >>> >> I didn't want to write "lets focus on Flink" or any other >> framework. >> >>> >> The >> >>> >> idea with benchmarks was to show two things: >> >>> >> >> >>> >> - why some people are doing bad PR for Spark >> >>> >> >> >>> >> - how - in easy way - we can change it and show that Spark is >> still on >> >>> >> the >> >>> >> top >> >>> >> >> >>> >> >> >>> >> No more, no less. Benchmarks will be helpful, but I don't think >> >>> >> they're the >> >>> >> most important thing in Spark :) On the Spark main page there is >> still >> >>> >> chart >> >>> >> "Spark vs Hadoop". It is important to show that framework is not >> the >> >>> >> same >> >>> >> Spark with other API, but much faster and optimized, comparable or >> >>> >> even >> >>> >> faster than other frameworks. >> >>> >> >> >>> >> >> >>> >> About real-time streaming, I think it would be just good to see it >> in >> >>> >> Spark. >> >>> >> I very like current Spark model, but many voices that says "we need >> >>> >> more" - >> >>> >> community should listen also them and try to help them. With SIPs >> it >> >>> >> would >> >>> >> be easier, I've just posted this example as "thing that may be >> changed >> >>> >> with >> >>> >> SIP". >> >>> >> >> >>> >> >> >>> >> I very like unification via Datasets, but there is a lot of >> algorithms >> >>> >> inside - let's make easy API, but with strong background (articles, >> >>> >> benchmarks, descriptions, etc) that shows that Spark is still >> modern >> >>> >> framework. >> >>> >> >> >>> >> >> >>> >> Maybe now my intention will be clearer :) As I said organizational >> >>> >> ideas >> >>> >> were already mentioned and I agree with them, my mail was just to >> show >> >>> >> some >> >>> >> aspects from my side, so from theside of developer and person who >> is >> >>> >> trying >> >>> >> to help others with Spark (via StackOverflow or other ways) >> >>> >> >> >>> >> >> >>> >> Pozdrawiam / Best regards, >> >>> >> >> >>> >> Tomasz >> >>> >> >> >>> >> >> >>> >> ________________________________ >> >>> >> Od: Cody Koeninger <c...@koeninger.org> >> >>> >> Wysłane: 17 października 2016 16:46 >> >>> >> Do: Debasish Das >> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org >> >>> >> Temat: Re: Spark Improvement Proposals >> >>> >> >> >>> >> I think narrowly focusing on Flink or benchmarks is missing my >> point. >> >>> >> >> >>> >> My point is evolve or die. Spark's governance and organization is >> >>> >> hampering its ability to evolve technologically, and it needs to >> >>> >> change. >> >>> >> >> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >> >>> >> <debasish.da...@gmail.com> >> >>> >> wrote: >> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in >> 2014 >> >>> >>> as >> >>> >>> soon as I looked into it since compared to writing Java map-reduce >> >>> >>> and >> >>> >>> Cascading code, Spark made writing distributed code fun...But now >> as >> >>> >>> we >> >>> >>> went >> >>> >>> deeper with Spark and real-time streaming use-case gets more >> >>> >>> prominent, I >> >>> >>> think it is time to bring a messaging model in conjunction with >> the >> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close >> >>> >>> integration with spark micro-batching APIs looks like a great >> >>> >>> direction to >> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated >> streaming >> >>> >>> with >> >>> >>> batch with the assumption is that micro-batching is sufficient to >> run >> >>> >>> SQL >> >>> >>> commands on stream but do we really have time to do SQL >> processing at >> >>> >>> streaming data within 1-2 seconds ? >> >>> >>> >> >>> >>> After reading the email chain, I started to look into Flink >> >>> >>> documentation >> >>> >>> and if you compare it with Spark documentation, I think we have >> major >> >>> >>> work >> >>> >>> to do detailing out Spark internals so that more people from >> >>> >>> community >> >>> >>> start >> >>> >>> to take active role in improving the issues so that Spark stays >> >>> >>> strong >> >>> >>> compared to Flink. >> >>> >>> >> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals >> >>> >>> >> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals >> >>> >>> >> >>> >>> Spark is no longer an engine that works for micro-batch and >> >>> >>> batch...We >> >>> >>> (and >> >>> >>> I am sure many others) are pushing spark as an engine for stream >> and >> >>> >>> query >> >>> >>> processing.....we need to make it a state-of-the-art engine for >> high >> >>> >>> speed >> >>> >>> streaming data and user queries as well ! >> >>> >>> >> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >> >>> >>> <tomasz.gaw...@outlook.com> >> >>> >>> wrote: >> >>> >>>> >> >>> >>>> Hi everyone, >> >>> >>>> >> >>> >>>> I'm quite late with my answer, but I think my suggestions may >> help a >> >>> >>>> little bit. :) Many technical and organizational topics were >> >>> >>>> mentioned, >> >>> >>>> but I want to focus on these negative posts about Spark and about >> >>> >>>> "haters" >> >>> >>>> >> >>> >>>> I really like Spark. Easy of use, speed, very good community - >> it's >> >>> >>>> everything here. But Every project has to "flight" on "framework >> >>> >>>> market" >> >>> >>>> to be still no 1. I'm following many Spark and Big Data >> communities, >> >>> >>>> maybe my mail will inspire someone :) >> >>> >>>> >> >>> >>>> You (every Spark developer; so far I didn't have enough time to >> join >> >>> >>>> contributing to Spark) has done excellent job. So why are some >> >>> >>>> people >> >>> >>>> saying that Flink (or other framework) is better, like it was >> posted >> >>> >>>> in >> >>> >>>> this mailing list? No, not because that framework is better in >> all >> >>> >>>> cases.. In my opinion, many of these discussions where started >> after >> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink >> vs >> >>> >>>> ...." >> >>> >>>> posts, almost every post in "winned" by Flink. Answers are >> sometimes >> >>> >>>> saying nothing about other frameworks, Flink's users (often >> PMC's) >> >>> >>>> are >> >>> >>>> just posting same information about real-time streaming, about >> delta >> >>> >>>> iterations, etc. It look smart and very often it is marked as an >> >>> >>>> aswer, >> >>> >>>> even if - in my opinion - there wasn't told all the truth. >> >>> >>>> >> >>> >>>> >> >>> >>>> My suggestion: I don't have enough money and knowledgle to >> perform >> >>> >>>> huge >> >>> >>>> performance test. Maybe some company, that supports Spark >> >>> >>>> (Databricks, >> >>> >>>> Cloudera? - just saying you're most visible in community :) ) >> could >> >>> >>>> perform performance test of: >> >>> >>>> >> >>> >>>> - streaming engine - probably Spark will loose because of >> mini-batch >> >>> >>>> model, however currently the difference should be much lower >> that in >> >>> >>>> previous versions >> >>> >>>> >> >>> >>>> - Machine Learning models >> >>> >>>> >> >>> >>>> - batch jobs >> >>> >>>> >> >>> >>>> - Graph jobs >> >>> >>>> >> >>> >>>> - SQL queries >> >>> >>>> >> >>> >>>> People will see that Spark is envolving and is also a modern >> >>> >>>> framework, >> >>> >>>> because after reading posts mentioned above people may think "it >> is >> >>> >>>> outdated, future is in framework X". >> >>> >>>> >> >>> >>>> Matei Zaharia posted excellent blog post about how Spark >> Structured >> >>> >>>> Streaming beats every other framework in terms of easy-of-use and >> >>> >>>> reliability. Performance tests, done in various environments (in >> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node >> >>> >>>> cluster), could be also very good marketing stuff to say "hey, >> >>> >>>> you're >> >>> >>>> telling that you're better, but Spark is still faster and is >> still >> >>> >>>> getting even more fast!". This would be based on facts (just >> >>> >>>> numbers), >> >>> >>>> not opinions. It would be good for companies, for marketing >> puproses >> >>> >>>> and >> >>> >>>> for every Spark developer >> >>> >>>> >> >>> >>>> >> >>> >>>> Second: real-time streaming. I've written some time ago about >> >>> >>>> real-time >> >>> >>>> streaming support in Spark Structured Streaming. Some work >> should be >> >>> >>>> done to make SSS more low-latency, but I think it's possible. >> Maybe >> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka? I >> >>> >>>> don't >> >>> >>>> know yet, it is good topic for SIP. However I think that Spark >> >>> >>>> should >> >>> >>>> have real-time streaming support. Currently I see many >> >>> >>>> posts/comments >> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very >> good >> >>> >>>> jobs with micro-batches, however I think it is possible to add >> also >> >>> >>>> more >> >>> >>>> real-time processing. >> >>> >>>> >> >>> >>>> Other people said much more and I agree with proposal of SIP. I'm >> >>> >>>> also >> >>> >>>> happy that PMC's are not saying that they will not listen to >> users, >> >>> >>>> but >> >>> >>>> they really want to make Spark better for every user. >> >>> >>>> >> >>> >>>> >> >>> >>>> What do you think about these two topics? Especially I'm looking >> at >> >>> >>>> Cody >> >>> >>>> (who has started this topic) and PMCs :) >> >>> >>>> >> >>> >>>> Pozdrawiam / Best regards, >> >>> >>>> >> >>> >>>> Tomasz >> >>> >>>> >> >>> >>>> >> >>> >> >> >> > >> > >> >