Updated. Any feedback from other community members?
On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org> wrote: > Thanks for doing that. > > Given that there are at least 4 different Apache voting processes, > "typical Apache vote process" isn't meaningful to me. > > I think the intention is that in order to pass, it needs at least 3 +1 > votes from PMC members *and no -1 votes from PMC members*. But the > document doesn't explicitly say that second part. > > There's also no mention of the duration a vote should remain open. > There's a mention of a month for finding a shepherd, but that's different. > > Other than that, LGTM. > > On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote: > >> Here's a new draft that incorporated most of the feedback: >> https://docs.google.com/document/d/1-Zdi_W-wtuxS9h >> TK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit# >> >> I added a specific role for SPIP Author and another one for SPIP Shepherd. >> >> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote: >> >>> During the summit, I also had a lot of discussions over similar topics >>> with multiple Committers and active users. I heard many fantastic ideas. I >>> believe Spark improvement proposals are good channels to collect the >>> requirements/designs. >>> >>> >>> IMO, we also need to consider the priority when working on these items. >>> Even if the proposal is accepted, it does not mean it will be implemented >>> and merged immediately. It is not a FIFO queue. >>> >>> >>> Even if some PRs are merged, sometimes, we still have to revert them >>> back, if the design and implementation are not reviewed carefully. We have >>> to ensure our quality. Spark is not an application software. It is an >>> infrastructure software that is being used by many many companies. We have >>> to be very careful in the design and implementation, especially >>> adding/changing the external APIs. >>> >>> >>> When I developed the Mainframe infrastructure/middleware software in the >>> past 6 years, I were involved in the discussions with external/internal >>> customers. The to-do feature list was always above 100. Sometimes, the >>> customers are feeling frustrated when we are unable to deliver them on time >>> due to the resource limits and others. Even if they paid us billions, we >>> still need to do it phase by phase or sometimes they have to accept the >>> workarounds. That is the reality everyone has to face, I think. >>> >>> >>> Thanks, >>> >>> >>> Xiao Li >>> >>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>: >>> >>>> At the spark summit this week, everyone from PMC members to users I had >>>> never met before were asking me about the Spark improvement proposals >>>> idea. It's clear that it's a real community need. >>>> >>>> But it's been almost half a year, and nothing visible has been done. >>>> >>>> Reynold, are you going to do this? >>>> >>>> If so, when? >>>> >>>> If not, why? >>>> >>>> You already did the right thing by including long-deserved committers. >>>> Please keep doing the right thing for the community. >>>> >>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> +1 on all counts (consensus, time bound, define roles) >>>>> >>>>> I can update the doc in the next few days and share back. Then maybe >>>>> we can just officially vote on this. As Tim suggested, we might not get it >>>>> 100% right the first time and would need to re-iterate. But that's fine. >>>>> >>>>> >>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com> >>>>> wrote: >>>>> >>>>>> Hi Cody, >>>>>> thank you for bringing up this topic, I agree it is very important to >>>>>> keep a cohesive community around some common, fluid goals. Here are a few >>>>>> comments about the current document: >>>>>> >>>>>> 1. name: it should not overlap with an existing one such as SIP. Can >>>>>> you imagine someone trying to discuss a scala spore proposal for spark? >>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP >>>>>> sounds great. >>>>>> >>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for >>>>>> technical decisions with a lasting impact. As such, the template should >>>>>> emphasize the role of the various parties during this process: >>>>>> >>>>>> - the SPIP author is responsible for building consensus. She is the >>>>>> champion driving the process forward and is responsible for ensuring that >>>>>> the SPIP follows the general guidelines. The author should be identified >>>>>> in >>>>>> the SPIP. The authorship of a SPIP can be transferred if the current >>>>>> author >>>>>> is not interested and someone else wants to move the SPIP forward. There >>>>>> should probably be 2-3 authors at most for each SPIP. >>>>>> >>>>>> - someone with voting power should probably shepherd the SPIP (and >>>>>> be recorded as such): ensuring that the final decision over the SPIP is >>>>>> recorded (rejected, accepted, etc.), and advising about the technical >>>>>> quality of the SPIP: this person need not be a champion for the SPIP or >>>>>> contribute to it, but rather makes sure it stands a chance of being >>>>>> approved when the vote happens. Also, if the author cannot find anyone >>>>>> who >>>>>> would want to take this role, this proposal is likely to be rejected >>>>>> anyway. >>>>>> >>>>>> - users, committers, contributors have the roles already outlined in >>>>>> the document >>>>>> >>>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it >>>>>> should move swiftly into either being accepted or rejected, so that we do >>>>>> not end up with a distracting long tail of half-hearted proposals. >>>>>> >>>>>> These rules are meant to be flexible, but the current document should >>>>>> be clear about who is in charge of a SPIP, and the state it is currently >>>>>> in. >>>>>> >>>>>> We have had long discussions over some very important questions such >>>>>> as approval. I do not have an opinion on these, but why not make a pick >>>>>> and >>>>>> reevaluate this decision later? This is not a binding process at this >>>>>> point. >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org> >>>>>> wrote: >>>>>> >>>>>>> I don't have a concern about voting vs consensus. >>>>>>> >>>>>>> I have a concern that whatever the decision making process is, it is >>>>>>> explicitly announced on the ticket for the given proposal, with an >>>>>>> explicit >>>>>>> deadline, and an explicit outcome. >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I'm also in favor of this. Thanks for your persistence Cody. >>>>>>>> >>>>>>>> My take on the specific issues Joseph mentioned: >>>>>>>> >>>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made >>>>>>>> earlier for consensus: >>>>>>>> >>>>>>>> > Majority vs consensus: My rationale is that I don't think we want >>>>>>>> to consider a proposal approved if it had objections serious enough >>>>>>>> that >>>>>>>> committers down-voted (or PMC depending on who gets a vote). If these >>>>>>>> proposals are like PEPs, then they represent a significant amount of >>>>>>>> community effort and I wouldn't want to move forward if up to half of >>>>>>>> the >>>>>>>> community thinks it's an untenable idea. >>>>>>>> >>>>>>>> 2) Design doc template -- agree this would be useful, but also >>>>>>>> seems totally orthogonal to moving forward on the SIP proposal. >>>>>>>> >>>>>>>> 3) agree w/ Joseph's proposal for updating the template. >>>>>>>> >>>>>>>> One small addition: >>>>>>>> >>>>>>>> 4) Deciding on a name -- minor, but I think its wroth >>>>>>>> disambiguating from Scala's SIPs, and the best proposal I've heard is >>>>>>>> "SPIP". At least, no one has objected. (don't care enough that I'd >>>>>>>> object to anything else, though.) >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley < >>>>>>>> jos...@databricks.com> wrote: >>>>>>>> >>>>>>>>> Hi Cody, >>>>>>>>> >>>>>>>>> Thanks for being persistent about this. I too would like to see >>>>>>>>> this happen. Reviewing the thread, it sounds like the main things >>>>>>>>> remaining are: >>>>>>>>> * Decide about a few issues >>>>>>>>> * Finalize the doc(s) >>>>>>>>> * Vote on this proposal >>>>>>>>> >>>>>>>>> Issues & TODOs: >>>>>>>>> >>>>>>>>> (1) The main issue I see above is voting vs. consensus. I have >>>>>>>>> little preference here. It sounds like something which could be >>>>>>>>> tailored >>>>>>>>> based on whether we see too many or too few SIPs being approved. >>>>>>>>> >>>>>>>>> (2) Design doc template (This would be great to have for Spark >>>>>>>>> regardless of this SIP discussion.) >>>>>>>>> * Reynold, are you still putting this together? >>>>>>>>> >>>>>>>>> (3) Template cleanups. Listing some items mentioned above + a new >>>>>>>>> one w.r.t. Reynold's draft >>>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#> >>>>>>>>> : >>>>>>>>> * Reinstate the "Where" section with links to current and past SIPs >>>>>>>>> * Add field for stating explicit deadlines for approval >>>>>>>>> * Add field for stating Author & Committer shepherd >>>>>>>>> >>>>>>>>> Thanks all! >>>>>>>>> Joseph >>>>>>>>> >>>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <c...@koeninger.org >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> I'm bumping this one more time for the new year, and then I'm >>>>>>>>>> giving up. >>>>>>>>>> >>>>>>>>>> Please, fix your process, even if it isn't exactly the way I >>>>>>>>>> suggested. >>>>>>>>>> >>>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> >>>>>>>>>> wrote: >>>>>>>>>> > On lazy consensus as opposed to voting: >>>>>>>>>> > >>>>>>>>>> > First, why lazy consensus? The proposal was for consensus, >>>>>>>>>> which is at least >>>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it >>>>>>>>>> requires >>>>>>>>>> > getting to a point where there is agreement. Isn't that >>>>>>>>>> agreement what we >>>>>>>>>> > want to achieve with these proposals? >>>>>>>>>> > >>>>>>>>>> > Second, lazy consensus only removes the requirement for three >>>>>>>>>> +1 votes. Why >>>>>>>>>> > would we not want at least three committers to think something >>>>>>>>>> is a good >>>>>>>>>> > idea before adopting the proposal? >>>>>>>>>> > >>>>>>>>>> > rb >>>>>>>>>> > >>>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger < >>>>>>>>>> c...@koeninger.org> wrote: >>>>>>>>>> >> >>>>>>>>>> >> So there are some minor things (the Where section heading >>>>>>>>>> appears to >>>>>>>>>> >> be dropped; wherever this document is posted it needs to >>>>>>>>>> actually link >>>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't >>>>>>>>>> look like >>>>>>>>>> >> I can comment on the google doc. >>>>>>>>>> >> >>>>>>>>>> >> The major substantive issue that I have is that this version is >>>>>>>>>> >> significantly less clear as to the outcome of an SIP. >>>>>>>>>> >> >>>>>>>>>> >> The apache example of lazy consensus at >>>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus >>>>>>>>>> involves an >>>>>>>>>> >> explicit announcement of an explicit deadline, which I think >>>>>>>>>> are >>>>>>>>>> >> necessary for clarity. >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin < >>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for >>>>>>>>>> non-owners, >>>>>>>>>> >> > so >>>>>>>>>> >> > I've just merged all the edits in place. It should be >>>>>>>>>> visible now. >>>>>>>>>> >> > >>>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin < >>>>>>>>>> r...@databricks.com> >>>>>>>>>> >> > wrote: >>>>>>>>>> >> >> >>>>>>>>>> >> >> Oops. Let me try figure that out. >>>>>>>>>> >> >> >>>>>>>>>> >> >> >>>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger < >>>>>>>>>> c...@koeninger.org> wrote: >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> Thanks for picking up on this. >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on >>>>>>>>>> the document >>>>>>>>>> >> >>> you linked. >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less >>>>>>>>>> of an issue >>>>>>>>>> >> >>> with that, sure. As long as it is clearly announced, >>>>>>>>>> lasts at least >>>>>>>>>> >> >>> 72 hours, and has a clear outcome. >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> The other points are hard to comment on without being able >>>>>>>>>> to see the >>>>>>>>>> >> >>> text in question. >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin < >>>>>>>>>> r...@databricks.com> >>>>>>>>>> >> >>> wrote: >>>>>>>>>> >> >>> > I just looked through the entire thread again tonight - >>>>>>>>>> there are a >>>>>>>>>> >> >>> > lot >>>>>>>>>> >> >>> > of >>>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the >>>>>>>>>> first crack >>>>>>>>>> >> >>> > at >>>>>>>>>> >> >>> > the >>>>>>>>>> >> >>> > proposal. >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of >>>>>>>>>> the most >>>>>>>>>> >> >>> > innovative >>>>>>>>>> >> >>> > and important projects in (big) data -- overall >>>>>>>>>> technical decisions >>>>>>>>>> >> >>> > made in >>>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as >>>>>>>>>> large and active >>>>>>>>>> >> >>> > as >>>>>>>>>> >> >>> > Spark always have room for improvement, and we as a >>>>>>>>>> community should >>>>>>>>>> >> >>> > strive >>>>>>>>>> >> >>> > to take it to the next level. >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in >>>>>>>>>> my opinion >>>>>>>>>> >> >>> > are: >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is >>>>>>>>>> difficult to >>>>>>>>>> >> >>> > know >>>>>>>>>> >> >>> > what >>>>>>>>>> >> >>> > really is going on. For people that don't follow >>>>>>>>>> closely, it is >>>>>>>>>> >> >>> > difficult to >>>>>>>>>> >> >>> > know what the important initiatives are. Even for people >>>>>>>>>> that do >>>>>>>>>> >> >>> > follow, it >>>>>>>>>> >> >>> > is difficult to know what specific things require their >>>>>>>>>> attention, >>>>>>>>>> >> >>> > since the >>>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and >>>>>>>>>> it's difficult >>>>>>>>>> >> >>> > to >>>>>>>>>> >> >>> > extract signal from noise. >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers >>>>>>>>>> themselves) >>>>>>>>>> >> >>> > input >>>>>>>>>> >> >>> > more proactively: At the end of the day the project >>>>>>>>>> provides value >>>>>>>>>> >> >>> > because >>>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build, >>>>>>>>>> but it is >>>>>>>>>> >> >>> > important >>>>>>>>>> >> >>> > to get their inputs. >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > I've taken Cody's doc and edited it: >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > https://docs.google.com/docume >>>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi >>>>>>>>>> ng=h.36ut37zh7w2b >>>>>>>>>> >> >>> > (I've made all my modifications trackable) >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > There are couple high level changes I made: >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy >>>>>>>>>> consensus >>>>>>>>>> >> >>> > as >>>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can >>>>>>>>>> easily be a >>>>>>>>>> >> >>> > "loser' >>>>>>>>>> >> >>> > that gets outvoted. >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to >>>>>>>>>> "optional >>>>>>>>>> >> >>> > design >>>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far >>>>>>>>>> aside from >>>>>>>>>> >> >>> > tagging >>>>>>>>>> >> >>> > things and linking them elsewhere simply having design >>>>>>>>>> docs and >>>>>>>>>> >> >>> > prototypes >>>>>>>>>> >> >>> > implementations in PRs is not something that has not >>>>>>>>>> worked so far". >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on >>>>>>>>>> visibility. For >>>>>>>>>> >> >>> > example, >>>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather >>>>>>>>>> than just >>>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails >>>>>>>>>> that go to >>>>>>>>>> >> >>> > dev@. >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > While I was editing this, I thought we really needed a >>>>>>>>>> suggested >>>>>>>>>> >> >>> > template >>>>>>>>>> >> >>> > for design doc too. I will get to that too ... >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin < >>>>>>>>>> r...@databricks.com> >>>>>>>>>> >> >>> > wrote: >>>>>>>>>> >> >>> >> >>>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to >>>>>>>>>> take a >>>>>>>>>> >> >>> >> closer >>>>>>>>>> >> >>> >> look >>>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1. >>>>>>>>>> >> >>> >> >>>>>>>>>> >> >>> >> >>>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin >>>>>>>>>> >> >>> >> <van...@cloudera.com> >>>>>>>>>> >> >>> >> wrote: >>>>>>>>>> >> >>> >>> >>>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though >>>>>>>>>> it's not >>>>>>>>>> >> >>> >>> explicitly >>>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template >>>>>>>>>> for the >>>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) >>>>>>>>>> would also be >>>>>>>>>> >> >>> >>> nice, >>>>>>>>>> >> >>> >>> but that can be done at any time. >>>>>>>>>> >> >>> >>> >>>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I >>>>>>>>>> consider a >>>>>>>>>> >> >>> >>> candidate >>>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document >>>>>>>>>> attached even >>>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone >>>>>>>>>> wants to try >>>>>>>>>> >> >>> >>> out >>>>>>>>>> >> >>> >>> the process... >>>>>>>>>> >> >>> >>> >>>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger >>>>>>>>>> >> >>> >>> <c...@koeninger.org> >>>>>>>>>> >> >>> >>> wrote: >>>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any >>>>>>>>>> committers >>>>>>>>>> >> >>> >>> > interested >>>>>>>>>> >> >>> >>> > in >>>>>>>>>> >> >>> >>> > moving forward with this? >>>>>>>>>> >> >>> >>> > >>>>>>>>>> >> >>> >>> > >>>>>>>>>> >> >>> >>> > >>>>>>>>>> >> >>> >>> > >>>>>>>>>> >> >>> >>> > https://github.com/koeninger/s >>>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md >>>>>>>>>> >> >>> >>> > >>>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the >>>>>>>>>> vine? >>>>>>>>>> >> >>> >>> > >>>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >>>>>>>>>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote: >>>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough. >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any >>>>>>>>>> other >>>>>>>>>> >> >>> >>> >> framework. >>>>>>>>>> >> >>> >>> >> The >>>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things: >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show >>>>>>>>>> that Spark is >>>>>>>>>> >> >>> >>> >> still on >>>>>>>>>> >> >>> >>> >> the >>>>>>>>>> >> >>> >>> >> top >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I >>>>>>>>>> don't think >>>>>>>>>> >> >>> >>> >> they're the >>>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main >>>>>>>>>> page there >>>>>>>>>> >> >>> >>> >> is >>>>>>>>>> >> >>> >>> >> still >>>>>>>>>> >> >>> >>> >> chart >>>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that >>>>>>>>>> framework is >>>>>>>>>> >> >>> >>> >> not >>>>>>>>>> >> >>> >>> >> the >>>>>>>>>> >> >>> >>> >> same >>>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and >>>>>>>>>> optimized, comparable >>>>>>>>>> >> >>> >>> >> or >>>>>>>>>> >> >>> >>> >> even >>>>>>>>>> >> >>> >>> >> faster than other frameworks. >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just >>>>>>>>>> good to see >>>>>>>>>> >> >>> >>> >> it >>>>>>>>>> >> >>> >>> >> in >>>>>>>>>> >> >>> >>> >> Spark. >>>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices >>>>>>>>>> that says "we >>>>>>>>>> >> >>> >>> >> need >>>>>>>>>> >> >>> >>> >> more" - >>>>>>>>>> >> >>> >>> >> community should listen also them and try to help >>>>>>>>>> them. With >>>>>>>>>> >> >>> >>> >> SIPs >>>>>>>>>> >> >>> >>> >> it >>>>>>>>>> >> >>> >>> >> would >>>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing >>>>>>>>>> that may be >>>>>>>>>> >> >>> >>> >> changed >>>>>>>>>> >> >>> >>> >> with >>>>>>>>>> >> >>> >>> >> SIP". >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is >>>>>>>>>> a lot of >>>>>>>>>> >> >>> >>> >> algorithms >>>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong >>>>>>>>>> background >>>>>>>>>> >> >>> >>> >> (articles, >>>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that >>>>>>>>>> Spark is still >>>>>>>>>> >> >>> >>> >> modern >>>>>>>>>> >> >>> >>> >> framework. >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said >>>>>>>>>> >> >>> >>> >> organizational >>>>>>>>>> >> >>> >>> >> ideas >>>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my >>>>>>>>>> mail was just >>>>>>>>>> >> >>> >>> >> to >>>>>>>>>> >> >>> >>> >> show >>>>>>>>>> >> >>> >>> >> some >>>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer >>>>>>>>>> and person >>>>>>>>>> >> >>> >>> >> who >>>>>>>>>> >> >>> >>> >> is >>>>>>>>>> >> >>> >>> >> trying >>>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or >>>>>>>>>> other ways) >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards, >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> Tomasz >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> ________________________________ >>>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org> >>>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46 >>>>>>>>>> >> >>> >>> >> Do: Debasish Das >>>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org >>>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is >>>>>>>>>> missing my >>>>>>>>>> >> >>> >>> >> point. >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> My point is evolve or die. Spark's governance and >>>>>>>>>> organization >>>>>>>>>> >> >>> >>> >> is >>>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, >>>>>>>>>> and it needs >>>>>>>>>> >> >>> >>> >> to >>>>>>>>>> >> >>> >>> >> change. >>>>>>>>>> >> >>> >>> >> >>>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >>>>>>>>>> >> >>> >>> >> <debasish.da...@gmail.com> >>>>>>>>>> >> >>> >>> >> wrote: >>>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I >>>>>>>>>> picked up Spark >>>>>>>>>> >> >>> >>> >>> in >>>>>>>>>> >> >>> >>> >>> 2014 >>>>>>>>>> >> >>> >>> >>> as >>>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing >>>>>>>>>> Java >>>>>>>>>> >> >>> >>> >>> map-reduce >>>>>>>>>> >> >>> >>> >>> and >>>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed >>>>>>>>>> code fun...But >>>>>>>>>> >> >>> >>> >>> now >>>>>>>>>> >> >>> >>> >>> as >>>>>>>>>> >> >>> >>> >>> we >>>>>>>>>> >> >>> >>> >>> went >>>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case >>>>>>>>>> gets more >>>>>>>>>> >> >>> >>> >>> prominent, I >>>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in >>>>>>>>>> conjunction >>>>>>>>>> >> >>> >>> >>> with >>>>>>>>>> >> >>> >>> >>> the >>>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good >>>>>>>>>> at....akka-streams >>>>>>>>>> >> >>> >>> >>> close >>>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks >>>>>>>>>> like a great >>>>>>>>>> >> >>> >>> >>> direction to >>>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 >>>>>>>>>> integrated >>>>>>>>>> >> >>> >>> >>> streaming >>>>>>>>>> >> >>> >>> >>> with >>>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching >>>>>>>>>> is sufficient >>>>>>>>>> >> >>> >>> >>> to >>>>>>>>>> >> >>> >>> >>> run >>>>>>>>>> >> >>> >>> >>> SQL >>>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to >>>>>>>>>> do SQL >>>>>>>>>> >> >>> >>> >>> processing at >>>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ? >>>>>>>>>> >> >>> >>> >>> >>>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look >>>>>>>>>> into Flink >>>>>>>>>> >> >>> >>> >>> documentation >>>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I >>>>>>>>>> think we >>>>>>>>>> >> >>> >>> >>> have >>>>>>>>>> >> >>> >>> >>> major >>>>>>>>>> >> >>> >>> >>> work >>>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more >>>>>>>>>> people from >>>>>>>>>> >> >>> >>> >>> community >>>>>>>>>> >> >>> >>> >>> start >>>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so >>>>>>>>>> that Spark >>>>>>>>>> >> >>> >>> >>> stays >>>>>>>>>> >> >>> >>> >>> strong >>>>>>>>>> >> >>> >>> >>> compared to Flink. >>>>>>>>>> >> >>> >>> >>> >>>>>>>>>> >> >>> >>> >>> >>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>>>>>> uence/display/SPARK/Spark+Internals >>>>>>>>>> >> >>> >>> >>> >>>>>>>>>> >> >>> >>> >>> >>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>>>>>> uence/display/FLINK/Flink+Internals >>>>>>>>>> >> >>> >>> >>> >>>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for >>>>>>>>>> micro-batch and >>>>>>>>>> >> >>> >>> >>> batch...We >>>>>>>>>> >> >>> >>> >>> (and >>>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an >>>>>>>>>> engine for >>>>>>>>>> >> >>> >>> >>> stream >>>>>>>>>> >> >>> >>> >>> and >>>>>>>>>> >> >>> >>> >>> query >>>>>>>>>> >> >>> >>> >>> processing.....we need to make it a >>>>>>>>>> state-of-the-art engine >>>>>>>>>> >> >>> >>> >>> for >>>>>>>>>> >> >>> >>> >>> high >>>>>>>>>> >> >>> >>> >>> speed >>>>>>>>>> >> >>> >>> >>> streaming data and user queries as well ! >>>>>>>>>> >> >>> >>> >>> >>>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >>>>>>>>>> >> >>> >>> >>> <tomasz.gaw...@outlook.com> >>>>>>>>>> >> >>> >>> >>> wrote: >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> Hi everyone, >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my >>>>>>>>>> suggestions may >>>>>>>>>> >> >>> >>> >>>> help a >>>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational >>>>>>>>>> topics were >>>>>>>>>> >> >>> >>> >>>> mentioned, >>>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about >>>>>>>>>> Spark and >>>>>>>>>> >> >>> >>> >>>> about >>>>>>>>>> >> >>> >>> >>>> "haters" >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very >>>>>>>>>> good community >>>>>>>>>> >> >>> >>> >>>> - >>>>>>>>>> >> >>> >>> >>>> it's >>>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to >>>>>>>>>> "flight" on >>>>>>>>>> >> >>> >>> >>>> "framework >>>>>>>>>> >> >>> >>> >>>> market" >>>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and >>>>>>>>>> Big Data >>>>>>>>>> >> >>> >>> >>>> communities, >>>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :) >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have >>>>>>>>>> enough time >>>>>>>>>> >> >>> >>> >>>> to >>>>>>>>>> >> >>> >>> >>>> join >>>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So >>>>>>>>>> why are >>>>>>>>>> >> >>> >>> >>>> some >>>>>>>>>> >> >>> >>> >>>> people >>>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, >>>>>>>>>> like it was >>>>>>>>>> >> >>> >>> >>>> posted >>>>>>>>>> >> >>> >>> >>>> in >>>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework >>>>>>>>>> is better >>>>>>>>>> >> >>> >>> >>>> in >>>>>>>>>> >> >>> >>> >>>> all >>>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions >>>>>>>>>> where >>>>>>>>>> >> >>> >>> >>>> started >>>>>>>>>> >> >>> >>> >>>> after >>>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at >>>>>>>>>> StackOverflow >>>>>>>>>> >> >>> >>> >>>> "Flink >>>>>>>>>> >> >>> >>> >>>> vs >>>>>>>>>> >> >>> >>> >>>> ...." >>>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. >>>>>>>>>> Answers are >>>>>>>>>> >> >>> >>> >>>> sometimes >>>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's >>>>>>>>>> users (often >>>>>>>>>> >> >>> >>> >>>> PMC's) >>>>>>>>>> >> >>> >>> >>>> are >>>>>>>>>> >> >>> >>> >>>> just posting same information about real-time >>>>>>>>>> streaming, >>>>>>>>>> >> >>> >>> >>>> about >>>>>>>>>> >> >>> >>> >>>> delta >>>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it >>>>>>>>>> is marked as >>>>>>>>>> >> >>> >>> >>>> an >>>>>>>>>> >> >>> >>> >>>> aswer, >>>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all >>>>>>>>>> the truth. >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and >>>>>>>>>> knowledgle to >>>>>>>>>> >> >>> >>> >>>> perform >>>>>>>>>> >> >>> >>> >>>> huge >>>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that >>>>>>>>>> supports Spark >>>>>>>>>> >> >>> >>> >>>> (Databricks, >>>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in >>>>>>>>>> community :) ) >>>>>>>>>> >> >>> >>> >>>> could >>>>>>>>>> >> >>> >>> >>>> perform performance test of: >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose >>>>>>>>>> because of >>>>>>>>>> >> >>> >>> >>>> mini-batch >>>>>>>>>> >> >>> >>> >>>> model, however currently the difference should be >>>>>>>>>> much lower >>>>>>>>>> >> >>> >>> >>>> that in >>>>>>>>>> >> >>> >>> >>>> previous versions >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> - Machine Learning models >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> - batch jobs >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> - Graph jobs >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> - SQL queries >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is >>>>>>>>>> also a modern >>>>>>>>>> >> >>> >>> >>>> framework, >>>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above >>>>>>>>>> people may think >>>>>>>>>> >> >>> >>> >>>> "it >>>>>>>>>> >> >>> >>> >>>> is >>>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X". >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about >>>>>>>>>> how Spark >>>>>>>>>> >> >>> >>> >>>> Structured >>>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of >>>>>>>>>> easy-of-use >>>>>>>>>> >> >>> >>> >>>> and >>>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various >>>>>>>>>> environments >>>>>>>>>> >> >>> >>> >>>> (in >>>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node >>>>>>>>>> cluster, >>>>>>>>>> >> >>> >>> >>>> 20-node >>>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff >>>>>>>>>> to say >>>>>>>>>> >> >>> >>> >>>> "hey, >>>>>>>>>> >> >>> >>> >>>> you're >>>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still >>>>>>>>>> faster and is >>>>>>>>>> >> >>> >>> >>>> still >>>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on >>>>>>>>>> facts (just >>>>>>>>>> >> >>> >>> >>>> numbers), >>>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for >>>>>>>>>> marketing >>>>>>>>>> >> >>> >>> >>>> puproses >>>>>>>>>> >> >>> >>> >>>> and >>>>>>>>>> >> >>> >>> >>>> for every Spark developer >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some >>>>>>>>>> time ago about >>>>>>>>>> >> >>> >>> >>>> real-time >>>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. >>>>>>>>>> Some work >>>>>>>>>> >> >>> >>> >>>> should be >>>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think >>>>>>>>>> it's possible. >>>>>>>>>> >> >>> >>> >>>> Maybe >>>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built >>>>>>>>>> on top of >>>>>>>>>> >> >>> >>> >>>> Akka? >>>>>>>>>> >> >>> >>> >>>> I >>>>>>>>>> >> >>> >>> >>>> don't >>>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I >>>>>>>>>> think that >>>>>>>>>> >> >>> >>> >>>> Spark >>>>>>>>>> >> >>> >>> >>>> should >>>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see >>>>>>>>>> many >>>>>>>>>> >> >>> >>> >>>> posts/comments >>>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming >>>>>>>>>> is doing >>>>>>>>>> >> >>> >>> >>>> very >>>>>>>>>> >> >>> >>> >>>> good >>>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is >>>>>>>>>> possible to >>>>>>>>>> >> >>> >>> >>>> add >>>>>>>>>> >> >>> >>> >>>> also >>>>>>>>>> >> >>> >>> >>>> more >>>>>>>>>> >> >>> >>> >>>> real-time processing. >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with >>>>>>>>>> proposal of SIP. >>>>>>>>>> >> >>> >>> >>>> I'm >>>>>>>>>> >> >>> >>> >>>> also >>>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will >>>>>>>>>> not listen to >>>>>>>>>> >> >>> >>> >>>> users, >>>>>>>>>> >> >>> >>> >>>> but >>>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every >>>>>>>>>> user. >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> What do you think about these two topics? >>>>>>>>>> Especially I'm >>>>>>>>>> >> >>> >>> >>>> looking >>>>>>>>>> >> >>> >>> >>>> at >>>>>>>>>> >> >>> >>> >>>> Cody >>>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :) >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards, >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> Tomasz >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>> >>>>>>>>>> >> >>> >>> >>>>>>>>>> >> >>> >> >>>>>>>>>> >> >>> > >>>>>>>>>> >> >>> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> >>>>>>>>>> >> ------------------------------------------------------------ >>>>>>>>>> --------- >>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>> >> >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > -- >>>>>>>>>> > Ryan Blue >>>>>>>>>> > Software Engineer >>>>>>>>>> > Netflix >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------ >>>>>>>>>> --------- >>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Joseph Bradley >>>>>>>>> >>>>>>>>> Software Engineer - Machine Learning >>>>>>>>> >>>>>>>>> Databricks, Inc. >>>>>>>>> >>>>>>>>> [image: http://databricks.com] <http://databricks.com/> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >