Thanks for doing that. Given that there are at least 4 different Apache voting processes, "typical Apache vote process" isn't meaningful to me.
I think the intention is that in order to pass, it needs at least 3 +1 votes from PMC members *and no -1 votes from PMC members*. But the document doesn't explicitly say that second part. There's also no mention of the duration a vote should remain open. There's a mention of a month for finding a shepherd, but that's different. Other than that, LGTM. On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote: > Here's a new draft that incorporated most of the feedback: > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x- > nRanvXmnZ7SUi4qMljg/edit# > > I added a specific role for SPIP Author and another one for SPIP Shepherd. > > On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote: > >> During the summit, I also had a lot of discussions over similar topics >> with multiple Committers and active users. I heard many fantastic ideas. I >> believe Spark improvement proposals are good channels to collect the >> requirements/designs. >> >> >> IMO, we also need to consider the priority when working on these items. >> Even if the proposal is accepted, it does not mean it will be implemented >> and merged immediately. It is not a FIFO queue. >> >> >> Even if some PRs are merged, sometimes, we still have to revert them >> back, if the design and implementation are not reviewed carefully. We have >> to ensure our quality. Spark is not an application software. It is an >> infrastructure software that is being used by many many companies. We have >> to be very careful in the design and implementation, especially >> adding/changing the external APIs. >> >> >> When I developed the Mainframe infrastructure/middleware software in the >> past 6 years, I were involved in the discussions with external/internal >> customers. The to-do feature list was always above 100. Sometimes, the >> customers are feeling frustrated when we are unable to deliver them on time >> due to the resource limits and others. Even if they paid us billions, we >> still need to do it phase by phase or sometimes they have to accept the >> workarounds. That is the reality everyone has to face, I think. >> >> >> Thanks, >> >> >> Xiao Li >> >> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>: >> >>> At the spark summit this week, everyone from PMC members to users I had >>> never met before were asking me about the Spark improvement proposals >>> idea. It's clear that it's a real community need. >>> >>> But it's been almost half a year, and nothing visible has been done. >>> >>> Reynold, are you going to do this? >>> >>> If so, when? >>> >>> If not, why? >>> >>> You already did the right thing by including long-deserved committers. >>> Please keep doing the right thing for the community. >>> >>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> +1 on all counts (consensus, time bound, define roles) >>>> >>>> I can update the doc in the next few days and share back. Then maybe we >>>> can just officially vote on this. As Tim suggested, we might not get it >>>> 100% right the first time and would need to re-iterate. But that's fine. >>>> >>>> >>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com> >>>> wrote: >>>> >>>>> Hi Cody, >>>>> thank you for bringing up this topic, I agree it is very important to >>>>> keep a cohesive community around some common, fluid goals. Here are a few >>>>> comments about the current document: >>>>> >>>>> 1. name: it should not overlap with an existing one such as SIP. Can >>>>> you imagine someone trying to discuss a scala spore proposal for spark? >>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP >>>>> sounds great. >>>>> >>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for >>>>> technical decisions with a lasting impact. As such, the template should >>>>> emphasize the role of the various parties during this process: >>>>> >>>>> - the SPIP author is responsible for building consensus. She is the >>>>> champion driving the process forward and is responsible for ensuring that >>>>> the SPIP follows the general guidelines. The author should be identified >>>>> in >>>>> the SPIP. The authorship of a SPIP can be transferred if the current >>>>> author >>>>> is not interested and someone else wants to move the SPIP forward. There >>>>> should probably be 2-3 authors at most for each SPIP. >>>>> >>>>> - someone with voting power should probably shepherd the SPIP (and be >>>>> recorded as such): ensuring that the final decision over the SPIP is >>>>> recorded (rejected, accepted, etc.), and advising about the technical >>>>> quality of the SPIP: this person need not be a champion for the SPIP or >>>>> contribute to it, but rather makes sure it stands a chance of being >>>>> approved when the vote happens. Also, if the author cannot find anyone who >>>>> would want to take this role, this proposal is likely to be rejected >>>>> anyway. >>>>> >>>>> - users, committers, contributors have the roles already outlined in >>>>> the document >>>>> >>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it >>>>> should move swiftly into either being accepted or rejected, so that we do >>>>> not end up with a distracting long tail of half-hearted proposals. >>>>> >>>>> These rules are meant to be flexible, but the current document should >>>>> be clear about who is in charge of a SPIP, and the state it is currently >>>>> in. >>>>> >>>>> We have had long discussions over some very important questions such >>>>> as approval. I do not have an opinion on these, but why not make a pick >>>>> and >>>>> reevaluate this decision later? This is not a binding process at this >>>>> point. >>>>> >>>>> Tim >>>>> >>>>> >>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org> >>>>> wrote: >>>>> >>>>>> I don't have a concern about voting vs consensus. >>>>>> >>>>>> I have a concern that whatever the decision making process is, it is >>>>>> explicitly announced on the ticket for the given proposal, with an >>>>>> explicit >>>>>> deadline, and an explicit outcome. >>>>>> >>>>>> >>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> I'm also in favor of this. Thanks for your persistence Cody. >>>>>>> >>>>>>> My take on the specific issues Joseph mentioned: >>>>>>> >>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made >>>>>>> earlier for consensus: >>>>>>> >>>>>>> > Majority vs consensus: My rationale is that I don't think we want >>>>>>> to consider a proposal approved if it had objections serious enough that >>>>>>> committers down-voted (or PMC depending on who gets a vote). If these >>>>>>> proposals are like PEPs, then they represent a significant amount of >>>>>>> community effort and I wouldn't want to move forward if up to half of >>>>>>> the >>>>>>> community thinks it's an untenable idea. >>>>>>> >>>>>>> 2) Design doc template -- agree this would be useful, but also seems >>>>>>> totally orthogonal to moving forward on the SIP proposal. >>>>>>> >>>>>>> 3) agree w/ Joseph's proposal for updating the template. >>>>>>> >>>>>>> One small addition: >>>>>>> >>>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating >>>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP". At >>>>>>> least, >>>>>>> no one has objected. (don't care enough that I'd object to anything >>>>>>> else, >>>>>>> though.) >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley < >>>>>>> jos...@databricks.com> wrote: >>>>>>> >>>>>>>> Hi Cody, >>>>>>>> >>>>>>>> Thanks for being persistent about this. I too would like to see >>>>>>>> this happen. Reviewing the thread, it sounds like the main things >>>>>>>> remaining are: >>>>>>>> * Decide about a few issues >>>>>>>> * Finalize the doc(s) >>>>>>>> * Vote on this proposal >>>>>>>> >>>>>>>> Issues & TODOs: >>>>>>>> >>>>>>>> (1) The main issue I see above is voting vs. consensus. I have >>>>>>>> little preference here. It sounds like something which could be >>>>>>>> tailored >>>>>>>> based on whether we see too many or too few SIPs being approved. >>>>>>>> >>>>>>>> (2) Design doc template (This would be great to have for Spark >>>>>>>> regardless of this SIP discussion.) >>>>>>>> * Reynold, are you still putting this together? >>>>>>>> >>>>>>>> (3) Template cleanups. Listing some items mentioned above + a new >>>>>>>> one w.r.t. Reynold's draft >>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#> >>>>>>>> : >>>>>>>> * Reinstate the "Where" section with links to current and past SIPs >>>>>>>> * Add field for stating explicit deadlines for approval >>>>>>>> * Add field for stating Author & Committer shepherd >>>>>>>> >>>>>>>> Thanks all! >>>>>>>> Joseph >>>>>>>> >>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <c...@koeninger.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I'm bumping this one more time for the new year, and then I'm >>>>>>>>> giving up. >>>>>>>>> >>>>>>>>> Please, fix your process, even if it isn't exactly the way I >>>>>>>>> suggested. >>>>>>>>> >>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> >>>>>>>>> wrote: >>>>>>>>> > On lazy consensus as opposed to voting: >>>>>>>>> > >>>>>>>>> > First, why lazy consensus? The proposal was for consensus, which >>>>>>>>> is at least >>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it >>>>>>>>> requires >>>>>>>>> > getting to a point where there is agreement. Isn't that >>>>>>>>> agreement what we >>>>>>>>> > want to achieve with these proposals? >>>>>>>>> > >>>>>>>>> > Second, lazy consensus only removes the requirement for three +1 >>>>>>>>> votes. Why >>>>>>>>> > would we not want at least three committers to think something >>>>>>>>> is a good >>>>>>>>> > idea before adopting the proposal? >>>>>>>>> > >>>>>>>>> > rb >>>>>>>>> > >>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger < >>>>>>>>> c...@koeninger.org> wrote: >>>>>>>>> >> >>>>>>>>> >> So there are some minor things (the Where section heading >>>>>>>>> appears to >>>>>>>>> >> be dropped; wherever this document is posted it needs to >>>>>>>>> actually link >>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't >>>>>>>>> look like >>>>>>>>> >> I can comment on the google doc. >>>>>>>>> >> >>>>>>>>> >> The major substantive issue that I have is that this version is >>>>>>>>> >> significantly less clear as to the outcome of an SIP. >>>>>>>>> >> >>>>>>>>> >> The apache example of lazy consensus at >>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus >>>>>>>>> involves an >>>>>>>>> >> explicit announcement of an explicit deadline, which I think are >>>>>>>>> >> necessary for clarity. >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin < >>>>>>>>> r...@databricks.com> wrote: >>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for >>>>>>>>> non-owners, >>>>>>>>> >> > so >>>>>>>>> >> > I've just merged all the edits in place. It should be visible >>>>>>>>> now. >>>>>>>>> >> > >>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin < >>>>>>>>> r...@databricks.com> >>>>>>>>> >> > wrote: >>>>>>>>> >> >> >>>>>>>>> >> >> Oops. Let me try figure that out. >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger < >>>>>>>>> c...@koeninger.org> wrote: >>>>>>>>> >> >>> >>>>>>>>> >> >>> Thanks for picking up on this. >>>>>>>>> >> >>> >>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on >>>>>>>>> the document >>>>>>>>> >> >>> you linked. >>>>>>>>> >> >>> >>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less >>>>>>>>> of an issue >>>>>>>>> >> >>> with that, sure. As long as it is clearly announced, lasts >>>>>>>>> at least >>>>>>>>> >> >>> 72 hours, and has a clear outcome. >>>>>>>>> >> >>> >>>>>>>>> >> >>> The other points are hard to comment on without being able >>>>>>>>> to see the >>>>>>>>> >> >>> text in question. >>>>>>>>> >> >>> >>>>>>>>> >> >>> >>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin < >>>>>>>>> r...@databricks.com> >>>>>>>>> >> >>> wrote: >>>>>>>>> >> >>> > I just looked through the entire thread again tonight - >>>>>>>>> there are a >>>>>>>>> >> >>> > lot >>>>>>>>> >> >>> > of >>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the >>>>>>>>> first crack >>>>>>>>> >> >>> > at >>>>>>>>> >> >>> > the >>>>>>>>> >> >>> > proposal. >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of >>>>>>>>> the most >>>>>>>>> >> >>> > innovative >>>>>>>>> >> >>> > and important projects in (big) data -- overall technical >>>>>>>>> decisions >>>>>>>>> >> >>> > made in >>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large >>>>>>>>> and active >>>>>>>>> >> >>> > as >>>>>>>>> >> >>> > Spark always have room for improvement, and we as a >>>>>>>>> community should >>>>>>>>> >> >>> > strive >>>>>>>>> >> >>> > to take it to the next level. >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my >>>>>>>>> opinion >>>>>>>>> >> >>> > are: >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is >>>>>>>>> difficult to >>>>>>>>> >> >>> > know >>>>>>>>> >> >>> > what >>>>>>>>> >> >>> > really is going on. For people that don't follow closely, >>>>>>>>> it is >>>>>>>>> >> >>> > difficult to >>>>>>>>> >> >>> > know what the important initiatives are. Even for people >>>>>>>>> that do >>>>>>>>> >> >>> > follow, it >>>>>>>>> >> >>> > is difficult to know what specific things require their >>>>>>>>> attention, >>>>>>>>> >> >>> > since the >>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and >>>>>>>>> it's difficult >>>>>>>>> >> >>> > to >>>>>>>>> >> >>> > extract signal from noise. >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers >>>>>>>>> themselves) >>>>>>>>> >> >>> > input >>>>>>>>> >> >>> > more proactively: At the end of the day the project >>>>>>>>> provides value >>>>>>>>> >> >>> > because >>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build, >>>>>>>>> but it is >>>>>>>>> >> >>> > important >>>>>>>>> >> >>> > to get their inputs. >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > I've taken Cody's doc and edited it: >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > https://docs.google.com/docume >>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi >>>>>>>>> ng=h.36ut37zh7w2b >>>>>>>>> >> >>> > (I've made all my modifications trackable) >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > There are couple high level changes I made: >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy >>>>>>>>> consensus >>>>>>>>> >> >>> > as >>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can >>>>>>>>> easily be a >>>>>>>>> >> >>> > "loser' >>>>>>>>> >> >>> > that gets outvoted. >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to >>>>>>>>> "optional >>>>>>>>> >> >>> > design >>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far >>>>>>>>> aside from >>>>>>>>> >> >>> > tagging >>>>>>>>> >> >>> > things and linking them elsewhere simply having design >>>>>>>>> docs and >>>>>>>>> >> >>> > prototypes >>>>>>>>> >> >>> > implementations in PRs is not something that has not >>>>>>>>> worked so far". >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on >>>>>>>>> visibility. For >>>>>>>>> >> >>> > example, >>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather >>>>>>>>> than just >>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that >>>>>>>>> go to >>>>>>>>> >> >>> > dev@. >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > While I was editing this, I thought we really needed a >>>>>>>>> suggested >>>>>>>>> >> >>> > template >>>>>>>>> >> >>> > for design doc too. I will get to that too ... >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin < >>>>>>>>> r...@databricks.com> >>>>>>>>> >> >>> > wrote: >>>>>>>>> >> >>> >> >>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to >>>>>>>>> take a >>>>>>>>> >> >>> >> closer >>>>>>>>> >> >>> >> look >>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1. >>>>>>>>> >> >>> >> >>>>>>>>> >> >>> >> >>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin >>>>>>>>> >> >>> >> <van...@cloudera.com> >>>>>>>>> >> >>> >> wrote: >>>>>>>>> >> >>> >>> >>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's >>>>>>>>> not >>>>>>>>> >> >>> >>> explicitly >>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template >>>>>>>>> for the >>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would >>>>>>>>> also be >>>>>>>>> >> >>> >>> nice, >>>>>>>>> >> >>> >>> but that can be done at any time. >>>>>>>>> >> >>> >>> >>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I >>>>>>>>> consider a >>>>>>>>> >> >>> >>> candidate >>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document >>>>>>>>> attached even >>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone >>>>>>>>> wants to try >>>>>>>>> >> >>> >>> out >>>>>>>>> >> >>> >>> the process... >>>>>>>>> >> >>> >>> >>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger >>>>>>>>> >> >>> >>> <c...@koeninger.org> >>>>>>>>> >> >>> >>> wrote: >>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any >>>>>>>>> committers >>>>>>>>> >> >>> >>> > interested >>>>>>>>> >> >>> >>> > in >>>>>>>>> >> >>> >>> > moving forward with this? >>>>>>>>> >> >>> >>> > >>>>>>>>> >> >>> >>> > >>>>>>>>> >> >>> >>> > >>>>>>>>> >> >>> >>> > >>>>>>>>> >> >>> >>> > https://github.com/koeninger/s >>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md >>>>>>>>> >> >>> >>> > >>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the >>>>>>>>> vine? >>>>>>>>> >> >>> >>> > >>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >>>>>>>>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote: >>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough. >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any >>>>>>>>> other >>>>>>>>> >> >>> >>> >> framework. >>>>>>>>> >> >>> >>> >> The >>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things: >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that >>>>>>>>> Spark is >>>>>>>>> >> >>> >>> >> still on >>>>>>>>> >> >>> >>> >> the >>>>>>>>> >> >>> >>> >> top >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I >>>>>>>>> don't think >>>>>>>>> >> >>> >>> >> they're the >>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main >>>>>>>>> page there >>>>>>>>> >> >>> >>> >> is >>>>>>>>> >> >>> >>> >> still >>>>>>>>> >> >>> >>> >> chart >>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that >>>>>>>>> framework is >>>>>>>>> >> >>> >>> >> not >>>>>>>>> >> >>> >>> >> the >>>>>>>>> >> >>> >>> >> same >>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized, >>>>>>>>> comparable >>>>>>>>> >> >>> >>> >> or >>>>>>>>> >> >>> >>> >> even >>>>>>>>> >> >>> >>> >> faster than other frameworks. >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just >>>>>>>>> good to see >>>>>>>>> >> >>> >>> >> it >>>>>>>>> >> >>> >>> >> in >>>>>>>>> >> >>> >>> >> Spark. >>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices >>>>>>>>> that says "we >>>>>>>>> >> >>> >>> >> need >>>>>>>>> >> >>> >>> >> more" - >>>>>>>>> >> >>> >>> >> community should listen also them and try to help >>>>>>>>> them. With >>>>>>>>> >> >>> >>> >> SIPs >>>>>>>>> >> >>> >>> >> it >>>>>>>>> >> >>> >>> >> would >>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing >>>>>>>>> that may be >>>>>>>>> >> >>> >>> >> changed >>>>>>>>> >> >>> >>> >> with >>>>>>>>> >> >>> >>> >> SIP". >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a >>>>>>>>> lot of >>>>>>>>> >> >>> >>> >> algorithms >>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong >>>>>>>>> background >>>>>>>>> >> >>> >>> >> (articles, >>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark >>>>>>>>> is still >>>>>>>>> >> >>> >>> >> modern >>>>>>>>> >> >>> >>> >> framework. >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said >>>>>>>>> >> >>> >>> >> organizational >>>>>>>>> >> >>> >>> >> ideas >>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my >>>>>>>>> mail was just >>>>>>>>> >> >>> >>> >> to >>>>>>>>> >> >>> >>> >> show >>>>>>>>> >> >>> >>> >> some >>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer >>>>>>>>> and person >>>>>>>>> >> >>> >>> >> who >>>>>>>>> >> >>> >>> >> is >>>>>>>>> >> >>> >>> >> trying >>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or >>>>>>>>> other ways) >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards, >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> Tomasz >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> ________________________________ >>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org> >>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46 >>>>>>>>> >> >>> >>> >> Do: Debasish Das >>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org >>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is >>>>>>>>> missing my >>>>>>>>> >> >>> >>> >> point. >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> My point is evolve or die. Spark's governance and >>>>>>>>> organization >>>>>>>>> >> >>> >>> >> is >>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and >>>>>>>>> it needs >>>>>>>>> >> >>> >>> >> to >>>>>>>>> >> >>> >>> >> change. >>>>>>>>> >> >>> >>> >> >>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >>>>>>>>> >> >>> >>> >> <debasish.da...@gmail.com> >>>>>>>>> >> >>> >>> >> wrote: >>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I >>>>>>>>> picked up Spark >>>>>>>>> >> >>> >>> >>> in >>>>>>>>> >> >>> >>> >>> 2014 >>>>>>>>> >> >>> >>> >>> as >>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing >>>>>>>>> Java >>>>>>>>> >> >>> >>> >>> map-reduce >>>>>>>>> >> >>> >>> >>> and >>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code >>>>>>>>> fun...But >>>>>>>>> >> >>> >>> >>> now >>>>>>>>> >> >>> >>> >>> as >>>>>>>>> >> >>> >>> >>> we >>>>>>>>> >> >>> >>> >>> went >>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case >>>>>>>>> gets more >>>>>>>>> >> >>> >>> >>> prominent, I >>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in >>>>>>>>> conjunction >>>>>>>>> >> >>> >>> >>> with >>>>>>>>> >> >>> >>> >>> the >>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good >>>>>>>>> at....akka-streams >>>>>>>>> >> >>> >>> >>> close >>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks >>>>>>>>> like a great >>>>>>>>> >> >>> >>> >>> direction to >>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 >>>>>>>>> integrated >>>>>>>>> >> >>> >>> >>> streaming >>>>>>>>> >> >>> >>> >>> with >>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is >>>>>>>>> sufficient >>>>>>>>> >> >>> >>> >>> to >>>>>>>>> >> >>> >>> >>> run >>>>>>>>> >> >>> >>> >>> SQL >>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do >>>>>>>>> SQL >>>>>>>>> >> >>> >>> >>> processing at >>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ? >>>>>>>>> >> >>> >>> >>> >>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look >>>>>>>>> into Flink >>>>>>>>> >> >>> >>> >>> documentation >>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I >>>>>>>>> think we >>>>>>>>> >> >>> >>> >>> have >>>>>>>>> >> >>> >>> >>> major >>>>>>>>> >> >>> >>> >>> work >>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more >>>>>>>>> people from >>>>>>>>> >> >>> >>> >>> community >>>>>>>>> >> >>> >>> >>> start >>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that >>>>>>>>> Spark >>>>>>>>> >> >>> >>> >>> stays >>>>>>>>> >> >>> >>> >>> strong >>>>>>>>> >> >>> >>> >>> compared to Flink. >>>>>>>>> >> >>> >>> >>> >>>>>>>>> >> >>> >>> >>> >>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>>>>> uence/display/SPARK/Spark+Internals >>>>>>>>> >> >>> >>> >>> >>>>>>>>> >> >>> >>> >>> >>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>>>>> uence/display/FLINK/Flink+Internals >>>>>>>>> >> >>> >>> >>> >>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for >>>>>>>>> micro-batch and >>>>>>>>> >> >>> >>> >>> batch...We >>>>>>>>> >> >>> >>> >>> (and >>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an >>>>>>>>> engine for >>>>>>>>> >> >>> >>> >>> stream >>>>>>>>> >> >>> >>> >>> and >>>>>>>>> >> >>> >>> >>> query >>>>>>>>> >> >>> >>> >>> processing.....we need to make it a >>>>>>>>> state-of-the-art engine >>>>>>>>> >> >>> >>> >>> for >>>>>>>>> >> >>> >>> >>> high >>>>>>>>> >> >>> >>> >>> speed >>>>>>>>> >> >>> >>> >>> streaming data and user queries as well ! >>>>>>>>> >> >>> >>> >>> >>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >>>>>>>>> >> >>> >>> >>> <tomasz.gaw...@outlook.com> >>>>>>>>> >> >>> >>> >>> wrote: >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> Hi everyone, >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my >>>>>>>>> suggestions may >>>>>>>>> >> >>> >>> >>>> help a >>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational >>>>>>>>> topics were >>>>>>>>> >> >>> >>> >>>> mentioned, >>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about >>>>>>>>> Spark and >>>>>>>>> >> >>> >>> >>>> about >>>>>>>>> >> >>> >>> >>>> "haters" >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good >>>>>>>>> community >>>>>>>>> >> >>> >>> >>>> - >>>>>>>>> >> >>> >>> >>>> it's >>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" >>>>>>>>> on >>>>>>>>> >> >>> >>> >>>> "framework >>>>>>>>> >> >>> >>> >>>> market" >>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big >>>>>>>>> Data >>>>>>>>> >> >>> >>> >>>> communities, >>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :) >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have >>>>>>>>> enough time >>>>>>>>> >> >>> >>> >>>> to >>>>>>>>> >> >>> >>> >>>> join >>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So >>>>>>>>> why are >>>>>>>>> >> >>> >>> >>>> some >>>>>>>>> >> >>> >>> >>>> people >>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, >>>>>>>>> like it was >>>>>>>>> >> >>> >>> >>>> posted >>>>>>>>> >> >>> >>> >>>> in >>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework >>>>>>>>> is better >>>>>>>>> >> >>> >>> >>>> in >>>>>>>>> >> >>> >>> >>>> all >>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions >>>>>>>>> where >>>>>>>>> >> >>> >>> >>>> started >>>>>>>>> >> >>> >>> >>>> after >>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at >>>>>>>>> StackOverflow >>>>>>>>> >> >>> >>> >>>> "Flink >>>>>>>>> >> >>> >>> >>>> vs >>>>>>>>> >> >>> >>> >>>> ...." >>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. >>>>>>>>> Answers are >>>>>>>>> >> >>> >>> >>>> sometimes >>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's >>>>>>>>> users (often >>>>>>>>> >> >>> >>> >>>> PMC's) >>>>>>>>> >> >>> >>> >>>> are >>>>>>>>> >> >>> >>> >>>> just posting same information about real-time >>>>>>>>> streaming, >>>>>>>>> >> >>> >>> >>>> about >>>>>>>>> >> >>> >>> >>>> delta >>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it >>>>>>>>> is marked as >>>>>>>>> >> >>> >>> >>>> an >>>>>>>>> >> >>> >>> >>>> aswer, >>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all >>>>>>>>> the truth. >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and >>>>>>>>> knowledgle to >>>>>>>>> >> >>> >>> >>>> perform >>>>>>>>> >> >>> >>> >>>> huge >>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that >>>>>>>>> supports Spark >>>>>>>>> >> >>> >>> >>>> (Databricks, >>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in >>>>>>>>> community :) ) >>>>>>>>> >> >>> >>> >>>> could >>>>>>>>> >> >>> >>> >>>> perform performance test of: >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose >>>>>>>>> because of >>>>>>>>> >> >>> >>> >>>> mini-batch >>>>>>>>> >> >>> >>> >>>> model, however currently the difference should be >>>>>>>>> much lower >>>>>>>>> >> >>> >>> >>>> that in >>>>>>>>> >> >>> >>> >>>> previous versions >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> - Machine Learning models >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> - batch jobs >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> - Graph jobs >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> - SQL queries >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is >>>>>>>>> also a modern >>>>>>>>> >> >>> >>> >>>> framework, >>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people >>>>>>>>> may think >>>>>>>>> >> >>> >>> >>>> "it >>>>>>>>> >> >>> >>> >>>> is >>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X". >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how >>>>>>>>> Spark >>>>>>>>> >> >>> >>> >>>> Structured >>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of >>>>>>>>> easy-of-use >>>>>>>>> >> >>> >>> >>>> and >>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various >>>>>>>>> environments >>>>>>>>> >> >>> >>> >>>> (in >>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node >>>>>>>>> cluster, >>>>>>>>> >> >>> >>> >>>> 20-node >>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff >>>>>>>>> to say >>>>>>>>> >> >>> >>> >>>> "hey, >>>>>>>>> >> >>> >>> >>>> you're >>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still >>>>>>>>> faster and is >>>>>>>>> >> >>> >>> >>>> still >>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on >>>>>>>>> facts (just >>>>>>>>> >> >>> >>> >>>> numbers), >>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for >>>>>>>>> marketing >>>>>>>>> >> >>> >>> >>>> puproses >>>>>>>>> >> >>> >>> >>>> and >>>>>>>>> >> >>> >>> >>>> for every Spark developer >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some >>>>>>>>> time ago about >>>>>>>>> >> >>> >>> >>>> real-time >>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. >>>>>>>>> Some work >>>>>>>>> >> >>> >>> >>>> should be >>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think >>>>>>>>> it's possible. >>>>>>>>> >> >>> >>> >>>> Maybe >>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on >>>>>>>>> top of >>>>>>>>> >> >>> >>> >>>> Akka? >>>>>>>>> >> >>> >>> >>>> I >>>>>>>>> >> >>> >>> >>>> don't >>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I >>>>>>>>> think that >>>>>>>>> >> >>> >>> >>>> Spark >>>>>>>>> >> >>> >>> >>>> should >>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see >>>>>>>>> many >>>>>>>>> >> >>> >>> >>>> posts/comments >>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming >>>>>>>>> is doing >>>>>>>>> >> >>> >>> >>>> very >>>>>>>>> >> >>> >>> >>>> good >>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is >>>>>>>>> possible to >>>>>>>>> >> >>> >>> >>>> add >>>>>>>>> >> >>> >>> >>>> also >>>>>>>>> >> >>> >>> >>>> more >>>>>>>>> >> >>> >>> >>>> real-time processing. >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with >>>>>>>>> proposal of SIP. >>>>>>>>> >> >>> >>> >>>> I'm >>>>>>>>> >> >>> >>> >>>> also >>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not >>>>>>>>> listen to >>>>>>>>> >> >>> >>> >>>> users, >>>>>>>>> >> >>> >>> >>>> but >>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every >>>>>>>>> user. >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> What do you think about these two topics? >>>>>>>>> Especially I'm >>>>>>>>> >> >>> >>> >>>> looking >>>>>>>>> >> >>> >>> >>>> at >>>>>>>>> >> >>> >>> >>>> Cody >>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :) >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards, >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> Tomasz >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> >>> >> >>>>>>>>> >> >>> > >>>>>>>>> >> >>> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> >>>>>>>>> >> ------------------------------------------------------------ >>>>>>>>> --------- >>>>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >> >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Ryan Blue >>>>>>>>> > Software Engineer >>>>>>>>> > Netflix >>>>>>>>> >>>>>>>>> ------------------------------------------------------------ >>>>>>>>> --------- >>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Joseph Bradley >>>>>>>> >>>>>>>> Software Engineer - Machine Learning >>>>>>>> >>>>>>>> Databricks, Inc. >>>>>>>> >>>>>>>> [image: http://databricks.com] <http://databricks.com/> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >