Oops. Let me try figure that out. On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote:
> Thanks for picking up on this. > > Maybe I fail at google docs, but I can't see any edits on the document > you linked. > > Regarding lazy consensus, if the board in general has less of an issue > with that, sure. As long as it is clearly announced, lasts at least > 72 hours, and has a clear outcome. > > The other points are hard to comment on without being able to see the > text in question. > > > On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com > <javascript:;>> wrote: > > I just looked through the entire thread again tonight - there are a lot > of > > great ideas being discussed. Thanks Cody for taking the first crack at > the > > proposal. > > > > I want to first comment on the context. Spark is one of the most > innovative > > and important projects in (big) data -- overall technical decisions made > in > > Apache Spark are sound. But of course, a project as large and active as > > Spark always have room for improvement, and we as a community should > strive > > to take it to the next level. > > > > To that end, the two biggest areas for improvements in my opinion are: > > > > 1. Visibility: There are so much happening that it is difficult to know > what > > really is going on. For people that don't follow closely, it is > difficult to > > know what the important initiatives are. Even for people that do follow, > it > > is difficult to know what specific things require their attention, since > the > > number of pull requests and JIRA tickets are high and it's difficult to > > extract signal from noise. > > > > 2. Solicit user (broadly defined, including developers themselves) input > > more proactively: At the end of the day the project provides value > because > > users use it. Users can't tell us exactly what to build, but it is > important > > to get their inputs. > > > > > > I've taken Cody's doc and edited it: > > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x- > nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b > > (I've made all my modifications trackable) > > > > There are couple high level changes I made: > > > > 1. I've consulted a board member and he recommended lazy consensus as > > opposed to voting. The reason being in voting there can easily be a > "loser' > > that gets outvoted. > > > > 2. I made it lighter weight, and renamed "strategy" to "optional design > > sketch". Echoing one of the earlier email: "IMHO so far aside from > tagging > > things and linking them elsewhere simply having design docs and > prototypes > > implementations in PRs is not something that has not worked so far". > > > > 3. I made some the language tweaks to focus more on visibility. For > example, > > "The purpose of an SIP is to inform and involve", rather than just > > "involve". SIPs should also have at least two emails that go to dev@. > > > > > > While I was editing this, I thought we really needed a suggested template > > for design doc too. I will get to that too ... > > > > > > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com > <javascript:;>> wrote: > >> > >> Most things looked OK to me too, although I do plan to take a closer > look > >> after Nov 1st when we cut the release branch for 2.1. > >> > >> > >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com > <javascript:;>> > >> wrote: > >>> > >>> The proposal looks OK to me. I assume, even though it's not explicitly > >>> called, that voting would happen by e-mail? A template for the > >>> proposal document (instead of just a bullet nice) would also be nice, > >>> but that can be done at any time. > >>> > >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate > >>> for a SIP, given the scope of the work. The document attached even > >>> somewhat matches the proposed format. So if anyone wants to try out > >>> the process... > >>> > >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org > <javascript:;>> > >>> wrote: > >>> > Now that spark summit europe is over, are any committers interested > in > >>> > moving forward with this? > >>> > > >>> > > >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark- > improvement-proposals.md > >>> > > >>> > Or are we going to let this discussion die on the vine? > >>> > > >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda > >>> > <tomasz.gaw...@outlook.com <javascript:;>> wrote: > >>> >> Maybe my mail was not clear enough. > >>> >> > >>> >> > >>> >> I didn't want to write "lets focus on Flink" or any other framework. > >>> >> The > >>> >> idea with benchmarks was to show two things: > >>> >> > >>> >> - why some people are doing bad PR for Spark > >>> >> > >>> >> - how - in easy way - we can change it and show that Spark is still > on > >>> >> the > >>> >> top > >>> >> > >>> >> > >>> >> No more, no less. Benchmarks will be helpful, but I don't think > >>> >> they're the > >>> >> most important thing in Spark :) On the Spark main page there is > still > >>> >> chart > >>> >> "Spark vs Hadoop". It is important to show that framework is not the > >>> >> same > >>> >> Spark with other API, but much faster and optimized, comparable or > >>> >> even > >>> >> faster than other frameworks. > >>> >> > >>> >> > >>> >> About real-time streaming, I think it would be just good to see it > in > >>> >> Spark. > >>> >> I very like current Spark model, but many voices that says "we need > >>> >> more" - > >>> >> community should listen also them and try to help them. With SIPs it > >>> >> would > >>> >> be easier, I've just posted this example as "thing that may be > changed > >>> >> with > >>> >> SIP". > >>> >> > >>> >> > >>> >> I very like unification via Datasets, but there is a lot of > algorithms > >>> >> inside - let's make easy API, but with strong background (articles, > >>> >> benchmarks, descriptions, etc) that shows that Spark is still modern > >>> >> framework. > >>> >> > >>> >> > >>> >> Maybe now my intention will be clearer :) As I said organizational > >>> >> ideas > >>> >> were already mentioned and I agree with them, my mail was just to > show > >>> >> some > >>> >> aspects from my side, so from theside of developer and person who is > >>> >> trying > >>> >> to help others with Spark (via StackOverflow or other ways) > >>> >> > >>> >> > >>> >> Pozdrawiam / Best regards, > >>> >> > >>> >> Tomasz > >>> >> > >>> >> > >>> >> ________________________________ > >>> >> Od: Cody Koeninger <c...@koeninger.org <javascript:;>> > >>> >> Wysłane: 17 października 2016 16:46 > >>> >> Do: Debasish Das > >>> >> DW: Tomasz Gawęda; dev@spark.apache.org <javascript:;> > >>> >> Temat: Re: Spark Improvement Proposals > >>> >> > >>> >> I think narrowly focusing on Flink or benchmarks is missing my > point. > >>> >> > >>> >> My point is evolve or die. Spark's governance and organization is > >>> >> hampering its ability to evolve technologically, and it needs to > >>> >> change. > >>> >> > >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das > >>> >> <debasish.da...@gmail.com <javascript:;>> > >>> >> wrote: > >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in > 2014 > >>> >>> as > >>> >>> soon as I looked into it since compared to writing Java map-reduce > >>> >>> and > >>> >>> Cascading code, Spark made writing distributed code fun...But now > as > >>> >>> we > >>> >>> went > >>> >>> deeper with Spark and real-time streaming use-case gets more > >>> >>> prominent, I > >>> >>> think it is time to bring a messaging model in conjunction with the > >>> >>> batch/micro-batch API that Spark is good at....akka-streams close > >>> >>> integration with spark micro-batching APIs looks like a great > >>> >>> direction to > >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming > >>> >>> with > >>> >>> batch with the assumption is that micro-batching is sufficient to > run > >>> >>> SQL > >>> >>> commands on stream but do we really have time to do SQL processing > at > >>> >>> streaming data within 1-2 seconds ? > >>> >>> > >>> >>> After reading the email chain, I started to look into Flink > >>> >>> documentation > >>> >>> and if you compare it with Spark documentation, I think we have > major > >>> >>> work > >>> >>> to do detailing out Spark internals so that more people from > >>> >>> community > >>> >>> start > >>> >>> to take active role in improving the issues so that Spark stays > >>> >>> strong > >>> >>> compared to Flink. > >>> >>> > >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals > >>> >>> > >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals > >>> >>> > >>> >>> Spark is no longer an engine that works for micro-batch and > >>> >>> batch...We > >>> >>> (and > >>> >>> I am sure many others) are pushing spark as an engine for stream > and > >>> >>> query > >>> >>> processing.....we need to make it a state-of-the-art engine for > high > >>> >>> speed > >>> >>> streaming data and user queries as well ! > >>> >>> > >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda > >>> >>> <tomasz.gaw...@outlook.com <javascript:;>> > >>> >>> wrote: > >>> >>>> > >>> >>>> Hi everyone, > >>> >>>> > >>> >>>> I'm quite late with my answer, but I think my suggestions may > help a > >>> >>>> little bit. :) Many technical and organizational topics were > >>> >>>> mentioned, > >>> >>>> but I want to focus on these negative posts about Spark and about > >>> >>>> "haters" > >>> >>>> > >>> >>>> I really like Spark. Easy of use, speed, very good community - > it's > >>> >>>> everything here. But Every project has to "flight" on "framework > >>> >>>> market" > >>> >>>> to be still no 1. I'm following many Spark and Big Data > communities, > >>> >>>> maybe my mail will inspire someone :) > >>> >>>> > >>> >>>> You (every Spark developer; so far I didn't have enough time to > join > >>> >>>> contributing to Spark) has done excellent job. So why are some > >>> >>>> people > >>> >>>> saying that Flink (or other framework) is better, like it was > posted > >>> >>>> in > >>> >>>> this mailing list? No, not because that framework is better in all > >>> >>>> cases.. In my opinion, many of these discussions where started > after > >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs > >>> >>>> ...." > >>> >>>> posts, almost every post in "winned" by Flink. Answers are > sometimes > >>> >>>> saying nothing about other frameworks, Flink's users (often PMC's) > >>> >>>> are > >>> >>>> just posting same information about real-time streaming, about > delta > >>> >>>> iterations, etc. It look smart and very often it is marked as an > >>> >>>> aswer, > >>> >>>> even if - in my opinion - there wasn't told all the truth. > >>> >>>> > >>> >>>> > >>> >>>> My suggestion: I don't have enough money and knowledgle to perform > >>> >>>> huge > >>> >>>> performance test. Maybe some company, that supports Spark > >>> >>>> (Databricks, > >>> >>>> Cloudera? - just saying you're most visible in community :) ) > could > >>> >>>> perform performance test of: > >>> >>>> > >>> >>>> - streaming engine - probably Spark will loose because of > mini-batch > >>> >>>> model, however currently the difference should be much lower that > in > >>> >>>> previous versions > >>> >>>> > >>> >>>> - Machine Learning models > >>> >>>> > >>> >>>> - batch jobs > >>> >>>> > >>> >>>> - Graph jobs > >>> >>>> > >>> >>>> - SQL queries > >>> >>>> > >>> >>>> People will see that Spark is envolving and is also a modern > >>> >>>> framework, > >>> >>>> because after reading posts mentioned above people may think "it > is > >>> >>>> outdated, future is in framework X". > >>> >>>> > >>> >>>> Matei Zaharia posted excellent blog post about how Spark > Structured > >>> >>>> Streaming beats every other framework in terms of easy-of-use and > >>> >>>> reliability. Performance tests, done in various environments (in > >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node > >>> >>>> cluster), could be also very good marketing stuff to say "hey, > >>> >>>> you're > >>> >>>> telling that you're better, but Spark is still faster and is still > >>> >>>> getting even more fast!". This would be based on facts (just > >>> >>>> numbers), > >>> >>>> not opinions. It would be good for companies, for marketing > puproses > >>> >>>> and > >>> >>>> for every Spark developer > >>> >>>> > >>> >>>> > >>> >>>> Second: real-time streaming. I've written some time ago about > >>> >>>> real-time > >>> >>>> streaming support in Spark Structured Streaming. Some work should > be > >>> >>>> done to make SSS more low-latency, but I think it's possible. > Maybe > >>> >>>> Spark may look at Gearpump, which is also built on top of Akka? I > >>> >>>> don't > >>> >>>> know yet, it is good topic for SIP. However I think that Spark > >>> >>>> should > >>> >>>> have real-time streaming support. Currently I see many > >>> >>>> posts/comments > >>> >>>> that "Spark has too big latency". Spark Streaming is doing very > good > >>> >>>> jobs with micro-batches, however I think it is possible to add > also > >>> >>>> more > >>> >>>> real-time processing. > >>> >>>> > >>> >>>> Other people said much more and I agree with proposal of SIP. I'm > >>> >>>> also > >>> >>>> happy that PMC's are not saying that they will not listen to > users, > >>> >>>> but > >>> >>>> they really want to make Spark better for every user. > >>> >>>> > >>> >>>> > >>> >>>> What do you think about these two topics? Especially I'm looking > at > >>> >>>> Cody > >>> >>>> (who has started this topic) and PMCs :) > >>> >>>> > >>> >>>> Pozdrawiam / Best regards, > >>> >>>> > >>> >>>> Tomasz > >>> >>>> > >>> >>>> > >>> > >> > > > > >