I'm bumping this one more time for the new year, and then I'm giving up. Please, fix your process, even if it isn't exactly the way I suggested.
On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote: > On lazy consensus as opposed to voting: > > First, why lazy consensus? The proposal was for consensus, which is at least > three +1 votes and no vetos. Consensus has no losing side, it requires > getting to a point where there is agreement. Isn't that agreement what we > want to achieve with these proposals? > > Second, lazy consensus only removes the requirement for three +1 votes. Why > would we not want at least three committers to think something is a good > idea before adopting the proposal? > > rb > > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> So there are some minor things (the Where section heading appears to >> be dropped; wherever this document is posted it needs to actually link >> to a jira filter showing current / past SIPs) but it doesn't look like >> I can comment on the google doc. >> >> The major substantive issue that I have is that this version is >> significantly less clear as to the outcome of an SIP. >> >> The apache example of lazy consensus at >> http://apache.org/foundation/voting.html#LazyConsensus involves an >> explicit announcement of an explicit deadline, which I think are >> necessary for clarity. >> >> >> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> wrote: >> > It turned out suggested edits (trackable) don't show up for non-owners, >> > so >> > I've just merged all the edits in place. It should be visible now. >> > >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> >> > wrote: >> >> >> >> Oops. Let me try figure that out. >> >> >> >> >> >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote: >> >>> >> >>> Thanks for picking up on this. >> >>> >> >>> Maybe I fail at google docs, but I can't see any edits on the document >> >>> you linked. >> >>> >> >>> Regarding lazy consensus, if the board in general has less of an issue >> >>> with that, sure. As long as it is clearly announced, lasts at least >> >>> 72 hours, and has a clear outcome. >> >>> >> >>> The other points are hard to comment on without being able to see the >> >>> text in question. >> >>> >> >>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> >> >>> wrote: >> >>> > I just looked through the entire thread again tonight - there are a >> >>> > lot >> >>> > of >> >>> > great ideas being discussed. Thanks Cody for taking the first crack >> >>> > at >> >>> > the >> >>> > proposal. >> >>> > >> >>> > I want to first comment on the context. Spark is one of the most >> >>> > innovative >> >>> > and important projects in (big) data -- overall technical decisions >> >>> > made in >> >>> > Apache Spark are sound. But of course, a project as large and active >> >>> > as >> >>> > Spark always have room for improvement, and we as a community should >> >>> > strive >> >>> > to take it to the next level. >> >>> > >> >>> > To that end, the two biggest areas for improvements in my opinion >> >>> > are: >> >>> > >> >>> > 1. Visibility: There are so much happening that it is difficult to >> >>> > know >> >>> > what >> >>> > really is going on. For people that don't follow closely, it is >> >>> > difficult to >> >>> > know what the important initiatives are. Even for people that do >> >>> > follow, it >> >>> > is difficult to know what specific things require their attention, >> >>> > since the >> >>> > number of pull requests and JIRA tickets are high and it's difficult >> >>> > to >> >>> > extract signal from noise. >> >>> > >> >>> > 2. Solicit user (broadly defined, including developers themselves) >> >>> > input >> >>> > more proactively: At the end of the day the project provides value >> >>> > because >> >>> > users use it. Users can't tell us exactly what to build, but it is >> >>> > important >> >>> > to get their inputs. >> >>> > >> >>> > >> >>> > I've taken Cody's doc and edited it: >> >>> > >> >>> > >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b >> >>> > (I've made all my modifications trackable) >> >>> > >> >>> > There are couple high level changes I made: >> >>> > >> >>> > 1. I've consulted a board member and he recommended lazy consensus >> >>> > as >> >>> > opposed to voting. The reason being in voting there can easily be a >> >>> > "loser' >> >>> > that gets outvoted. >> >>> > >> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional >> >>> > design >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from >> >>> > tagging >> >>> > things and linking them elsewhere simply having design docs and >> >>> > prototypes >> >>> > implementations in PRs is not something that has not worked so far". >> >>> > >> >>> > 3. I made some the language tweaks to focus more on visibility. For >> >>> > example, >> >>> > "The purpose of an SIP is to inform and involve", rather than just >> >>> > "involve". SIPs should also have at least two emails that go to >> >>> > dev@. >> >>> > >> >>> > >> >>> > While I was editing this, I thought we really needed a suggested >> >>> > template >> >>> > for design doc too. I will get to that too ... >> >>> > >> >>> > >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> >> >>> > wrote: >> >>> >> >> >>> >> Most things looked OK to me too, although I do plan to take a >> >>> >> closer >> >>> >> look >> >>> >> after Nov 1st when we cut the release branch for 2.1. >> >>> >> >> >>> >> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin >> >>> >> <van...@cloudera.com> >> >>> >> wrote: >> >>> >>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not >> >>> >>> explicitly >> >>> >>> called, that voting would happen by e-mail? A template for the >> >>> >>> proposal document (instead of just a bullet nice) would also be >> >>> >>> nice, >> >>> >>> but that can be done at any time. >> >>> >>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a >> >>> >>> candidate >> >>> >>> for a SIP, given the scope of the work. The document attached even >> >>> >>> somewhat matches the proposed format. So if anyone wants to try >> >>> >>> out >> >>> >>> the process... >> >>> >>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger >> >>> >>> <c...@koeninger.org> >> >>> >>> wrote: >> >>> >>> > Now that spark summit europe is over, are any committers >> >>> >>> > interested >> >>> >>> > in >> >>> >>> > moving forward with this? >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md >> >>> >>> > >> >>> >>> > Or are we going to let this discussion die on the vine? >> >>> >>> > >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >> >>> >>> > <tomasz.gaw...@outlook.com> wrote: >> >>> >>> >> Maybe my mail was not clear enough. >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other >> >>> >>> >> framework. >> >>> >>> >> The >> >>> >>> >> idea with benchmarks was to show two things: >> >>> >>> >> >> >>> >>> >> - why some people are doing bad PR for Spark >> >>> >>> >> >> >>> >>> >> - how - in easy way - we can change it and show that Spark is >> >>> >>> >> still on >> >>> >>> >> the >> >>> >>> >> top >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think >> >>> >>> >> they're the >> >>> >>> >> most important thing in Spark :) On the Spark main page there >> >>> >>> >> is >> >>> >>> >> still >> >>> >>> >> chart >> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is >> >>> >>> >> not >> >>> >>> >> the >> >>> >>> >> same >> >>> >>> >> Spark with other API, but much faster and optimized, comparable >> >>> >>> >> or >> >>> >>> >> even >> >>> >>> >> faster than other frameworks. >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> About real-time streaming, I think it would be just good to see >> >>> >>> >> it >> >>> >>> >> in >> >>> >>> >> Spark. >> >>> >>> >> I very like current Spark model, but many voices that says "we >> >>> >>> >> need >> >>> >>> >> more" - >> >>> >>> >> community should listen also them and try to help them. With >> >>> >>> >> SIPs >> >>> >>> >> it >> >>> >>> >> would >> >>> >>> >> be easier, I've just posted this example as "thing that may be >> >>> >>> >> changed >> >>> >>> >> with >> >>> >>> >> SIP". >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> I very like unification via Datasets, but there is a lot of >> >>> >>> >> algorithms >> >>> >>> >> inside - let's make easy API, but with strong background >> >>> >>> >> (articles, >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still >> >>> >>> >> modern >> >>> >>> >> framework. >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> Maybe now my intention will be clearer :) As I said >> >>> >>> >> organizational >> >>> >>> >> ideas >> >>> >>> >> were already mentioned and I agree with them, my mail was just >> >>> >>> >> to >> >>> >>> >> show >> >>> >>> >> some >> >>> >>> >> aspects from my side, so from theside of developer and person >> >>> >>> >> who >> >>> >>> >> is >> >>> >>> >> trying >> >>> >>> >> to help others with Spark (via StackOverflow or other ways) >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> Pozdrawiam / Best regards, >> >>> >>> >> >> >>> >>> >> Tomasz >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> ________________________________ >> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org> >> >>> >>> >> Wysłane: 17 października 2016 16:46 >> >>> >>> >> Do: Debasish Das >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org >> >>> >>> >> Temat: Re: Spark Improvement Proposals >> >>> >>> >> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my >> >>> >>> >> point. >> >>> >>> >> >> >>> >>> >> My point is evolve or die. Spark's governance and organization >> >>> >>> >> is >> >>> >>> >> hampering its ability to evolve technologically, and it needs >> >>> >>> >> to >> >>> >>> >> change. >> >>> >>> >> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >> >>> >>> >> <debasish.da...@gmail.com> >> >>> >>> >> wrote: >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark >> >>> >>> >>> in >> >>> >>> >>> 2014 >> >>> >>> >>> as >> >>> >>> >>> soon as I looked into it since compared to writing Java >> >>> >>> >>> map-reduce >> >>> >>> >>> and >> >>> >>> >>> Cascading code, Spark made writing distributed code fun...But >> >>> >>> >>> now >> >>> >>> >>> as >> >>> >>> >>> we >> >>> >>> >>> went >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets more >> >>> >>> >>> prominent, I >> >>> >>> >>> think it is time to bring a messaging model in conjunction >> >>> >>> >>> with >> >>> >>> >>> the >> >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams >> >>> >>> >>> close >> >>> >>> >>> integration with spark micro-batching APIs looks like a great >> >>> >>> >>> direction to >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated >> >>> >>> >>> streaming >> >>> >>> >>> with >> >>> >>> >>> batch with the assumption is that micro-batching is sufficient >> >>> >>> >>> to >> >>> >>> >>> run >> >>> >>> >>> SQL >> >>> >>> >>> commands on stream but do we really have time to do SQL >> >>> >>> >>> processing at >> >>> >>> >>> streaming data within 1-2 seconds ? >> >>> >>> >>> >> >>> >>> >>> After reading the email chain, I started to look into Flink >> >>> >>> >>> documentation >> >>> >>> >>> and if you compare it with Spark documentation, I think we >> >>> >>> >>> have >> >>> >>> >>> major >> >>> >>> >>> work >> >>> >>> >>> to do detailing out Spark internals so that more people from >> >>> >>> >>> community >> >>> >>> >>> start >> >>> >>> >>> to take active role in improving the issues so that Spark >> >>> >>> >>> stays >> >>> >>> >>> strong >> >>> >>> >>> compared to Flink. >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals >> >>> >>> >>> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch and >> >>> >>> >>> batch...We >> >>> >>> >>> (and >> >>> >>> >>> I am sure many others) are pushing spark as an engine for >> >>> >>> >>> stream >> >>> >>> >>> and >> >>> >>> >>> query >> >>> >>> >>> processing.....we need to make it a state-of-the-art engine >> >>> >>> >>> for >> >>> >>> >>> high >> >>> >>> >>> speed >> >>> >>> >>> streaming data and user queries as well ! >> >>> >>> >>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >> >>> >>> >>> <tomasz.gaw...@outlook.com> >> >>> >>> >>> wrote: >> >>> >>> >>>> >> >>> >>> >>>> Hi everyone, >> >>> >>> >>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may >> >>> >>> >>>> help a >> >>> >>> >>>> little bit. :) Many technical and organizational topics were >> >>> >>> >>>> mentioned, >> >>> >>> >>>> but I want to focus on these negative posts about Spark and >> >>> >>> >>>> about >> >>> >>> >>>> "haters" >> >>> >>> >>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community >> >>> >>> >>>> - >> >>> >>> >>>> it's >> >>> >>> >>>> everything here. But Every project has to "flight" on >> >>> >>> >>>> "framework >> >>> >>> >>>> market" >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data >> >>> >>> >>>> communities, >> >>> >>> >>>> maybe my mail will inspire someone :) >> >>> >>> >>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time >> >>> >>> >>>> to >> >>> >>> >>>> join >> >>> >>> >>>> contributing to Spark) has done excellent job. So why are >> >>> >>> >>>> some >> >>> >>> >>>> people >> >>> >>> >>>> saying that Flink (or other framework) is better, like it was >> >>> >>> >>>> posted >> >>> >>> >>>> in >> >>> >>> >>>> this mailing list? No, not because that framework is better >> >>> >>> >>>> in >> >>> >>> >>>> all >> >>> >>> >>>> cases.. In my opinion, many of these discussions where >> >>> >>> >>>> started >> >>> >>> >>>> after >> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow >> >>> >>> >>>> "Flink >> >>> >>> >>>> vs >> >>> >>> >>>> ...." >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are >> >>> >>> >>>> sometimes >> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often >> >>> >>> >>>> PMC's) >> >>> >>> >>>> are >> >>> >>> >>>> just posting same information about real-time streaming, >> >>> >>> >>>> about >> >>> >>> >>>> delta >> >>> >>> >>>> iterations, etc. It look smart and very often it is marked as >> >>> >>> >>>> an >> >>> >>> >>>> aswer, >> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth. >> >>> >>> >>>> >> >>> >>> >>>> >> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to >> >>> >>> >>>> perform >> >>> >>> >>>> huge >> >>> >>> >>>> performance test. Maybe some company, that supports Spark >> >>> >>> >>>> (Databricks, >> >>> >>> >>>> Cloudera? - just saying you're most visible in community :) ) >> >>> >>> >>>> could >> >>> >>> >>>> perform performance test of: >> >>> >>> >>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose because of >> >>> >>> >>>> mini-batch >> >>> >>> >>>> model, however currently the difference should be much lower >> >>> >>> >>>> that in >> >>> >>> >>>> previous versions >> >>> >>> >>>> >> >>> >>> >>>> - Machine Learning models >> >>> >>> >>>> >> >>> >>> >>>> - batch jobs >> >>> >>> >>>> >> >>> >>> >>>> - Graph jobs >> >>> >>> >>>> >> >>> >>> >>>> - SQL queries >> >>> >>> >>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a modern >> >>> >>> >>>> framework, >> >>> >>> >>>> because after reading posts mentioned above people may think >> >>> >>> >>>> "it >> >>> >>> >>>> is >> >>> >>> >>>> outdated, future is in framework X". >> >>> >>> >>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark >> >>> >>> >>>> Structured >> >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use >> >>> >>> >>>> and >> >>> >>> >>>> reliability. Performance tests, done in various environments >> >>> >>> >>>> (in >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, >> >>> >>> >>>> 20-node >> >>> >>> >>>> cluster), could be also very good marketing stuff to say >> >>> >>> >>>> "hey, >> >>> >>> >>>> you're >> >>> >>> >>>> telling that you're better, but Spark is still faster and is >> >>> >>> >>>> still >> >>> >>> >>>> getting even more fast!". This would be based on facts (just >> >>> >>> >>>> numbers), >> >>> >>> >>>> not opinions. It would be good for companies, for marketing >> >>> >>> >>>> puproses >> >>> >>> >>>> and >> >>> >>> >>>> for every Spark developer >> >>> >>> >>>> >> >>> >>> >>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time ago about >> >>> >>> >>>> real-time >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work >> >>> >>> >>>> should be >> >>> >>> >>>> done to make SSS more low-latency, but I think it's possible. >> >>> >>> >>>> Maybe >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of >> >>> >>> >>>> Akka? >> >>> >>> >>>> I >> >>> >>> >>>> don't >> >>> >>> >>>> know yet, it is good topic for SIP. However I think that >> >>> >>> >>>> Spark >> >>> >>> >>>> should >> >>> >>> >>>> have real-time streaming support. Currently I see many >> >>> >>> >>>> posts/comments >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing >> >>> >>> >>>> very >> >>> >>> >>>> good >> >>> >>> >>>> jobs with micro-batches, however I think it is possible to >> >>> >>> >>>> add >> >>> >>> >>>> also >> >>> >>> >>>> more >> >>> >>> >>>> real-time processing. >> >>> >>> >>>> >> >>> >>> >>>> Other people said much more and I agree with proposal of SIP. >> >>> >>> >>>> I'm >> >>> >>> >>>> also >> >>> >>> >>>> happy that PMC's are not saying that they will not listen to >> >>> >>> >>>> users, >> >>> >>> >>>> but >> >>> >>> >>>> they really want to make Spark better for every user. >> >>> >>> >>>> >> >>> >>> >>>> >> >>> >>> >>>> What do you think about these two topics? Especially I'm >> >>> >>> >>>> looking >> >>> >>> >>>> at >> >>> >>> >>>> Cody >> >>> >>> >>>> (who has started this topic) and PMCs :) >> >>> >>> >>>> >> >>> >>> >>>> Pozdrawiam / Best regards, >> >>> >>> >>>> >> >>> >>> >>>> Tomasz >> >>> >>> >>>> >> >>> >>> >>>> >> >>> >>> >> >>> >> >> >>> > >> >>> > >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > > > -- > Ryan Blue > Software Engineer > Netflix --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org