At the spark summit this week, everyone from PMC members to users I had never met before were asking me about the Spark improvement proposals idea. It's clear that it's a real community need.
But it's been almost half a year, and nothing visible has been done. Reynold, are you going to do this? If so, when? If not, why? You already did the right thing by including long-deserved committers. Please keep doing the right thing for the community. On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> wrote: > +1 on all counts (consensus, time bound, define roles) > > I can update the doc in the next few days and share back. Then maybe we > can just officially vote on this. As Tim suggested, we might not get it > 100% right the first time and would need to re-iterate. But that's fine. > > > On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com> > wrote: > >> Hi Cody, >> thank you for bringing up this topic, I agree it is very important to >> keep a cohesive community around some common, fluid goals. Here are a few >> comments about the current document: >> >> 1. name: it should not overlap with an existing one such as SIP. Can you >> imagine someone trying to discuss a scala spore proposal for spark? >> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP >> sounds great. >> >> 2. roles: at a high level, SPIPs are meant to reach consensus for >> technical decisions with a lasting impact. As such, the template should >> emphasize the role of the various parties during this process: >> >> - the SPIP author is responsible for building consensus. She is the >> champion driving the process forward and is responsible for ensuring that >> the SPIP follows the general guidelines. The author should be identified in >> the SPIP. The authorship of a SPIP can be transferred if the current author >> is not interested and someone else wants to move the SPIP forward. There >> should probably be 2-3 authors at most for each SPIP. >> >> - someone with voting power should probably shepherd the SPIP (and be >> recorded as such): ensuring that the final decision over the SPIP is >> recorded (rejected, accepted, etc.), and advising about the technical >> quality of the SPIP: this person need not be a champion for the SPIP or >> contribute to it, but rather makes sure it stands a chance of being >> approved when the vote happens. Also, if the author cannot find anyone who >> would want to take this role, this proposal is likely to be rejected anyway. >> >> - users, committers, contributors have the roles already outlined in the >> document >> >> 3. timeline: ideally, once a SPIP has been offered for voting, it should >> move swiftly into either being accepted or rejected, so that we do not end >> up with a distracting long tail of half-hearted proposals. >> >> These rules are meant to be flexible, but the current document should be >> clear about who is in charge of a SPIP, and the state it is currently in. >> >> We have had long discussions over some very important questions such as >> approval. I do not have an opinion on these, but why not make a pick and >> reevaluate this decision later? This is not a binding process at this point. >> >> Tim >> >> >> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >>> I don't have a concern about voting vs consensus. >>> >>> I have a concern that whatever the decision making process is, it is >>> explicitly announced on the ticket for the given proposal, with an explicit >>> deadline, and an explicit outcome. >>> >>> >>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com> >>> wrote: >>> >>>> I'm also in favor of this. Thanks for your persistence Cody. >>>> >>>> My take on the specific issues Joseph mentioned: >>>> >>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made >>>> earlier for consensus: >>>> >>>> > Majority vs consensus: My rationale is that I don't think we want to >>>> consider a proposal approved if it had objections serious enough that >>>> committers down-voted (or PMC depending on who gets a vote). If these >>>> proposals are like PEPs, then they represent a significant amount of >>>> community effort and I wouldn't want to move forward if up to half of the >>>> community thinks it's an untenable idea. >>>> >>>> 2) Design doc template -- agree this would be useful, but also seems >>>> totally orthogonal to moving forward on the SIP proposal. >>>> >>>> 3) agree w/ Joseph's proposal for updating the template. >>>> >>>> One small addition: >>>> >>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating >>>> from Scala's SIPs, and the best proposal I've heard is "SPIP". At least, >>>> no one has objected. (don't care enough that I'd object to anything else, >>>> though.) >>>> >>>> >>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jos...@databricks.com> >>>> wrote: >>>> >>>>> Hi Cody, >>>>> >>>>> Thanks for being persistent about this. I too would like to see this >>>>> happen. Reviewing the thread, it sounds like the main things remaining >>>>> are: >>>>> * Decide about a few issues >>>>> * Finalize the doc(s) >>>>> * Vote on this proposal >>>>> >>>>> Issues & TODOs: >>>>> >>>>> (1) The main issue I see above is voting vs. consensus. I have little >>>>> preference here. It sounds like something which could be tailored based >>>>> on >>>>> whether we see too many or too few SIPs being approved. >>>>> >>>>> (2) Design doc template (This would be great to have for Spark >>>>> regardless of this SIP discussion.) >>>>> * Reynold, are you still putting this together? >>>>> >>>>> (3) Template cleanups. Listing some items mentioned above + a new one >>>>> w.r.t. Reynold's draft >>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#> >>>>> : >>>>> * Reinstate the "Where" section with links to current and past SIPs >>>>> * Add field for stating explicit deadlines for approval >>>>> * Add field for stating Author & Committer shepherd >>>>> >>>>> Thanks all! >>>>> Joseph >>>>> >>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <c...@koeninger.org> >>>>> wrote: >>>>> >>>>>> I'm bumping this one more time for the new year, and then I'm giving >>>>>> up. >>>>>> >>>>>> Please, fix your process, even if it isn't exactly the way I >>>>>> suggested. >>>>>> >>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote: >>>>>> > On lazy consensus as opposed to voting: >>>>>> > >>>>>> > First, why lazy consensus? The proposal was for consensus, which is >>>>>> at least >>>>>> > three +1 votes and no vetos. Consensus has no losing side, it >>>>>> requires >>>>>> > getting to a point where there is agreement. Isn't that agreement >>>>>> what we >>>>>> > want to achieve with these proposals? >>>>>> > >>>>>> > Second, lazy consensus only removes the requirement for three +1 >>>>>> votes. Why >>>>>> > would we not want at least three committers to think something is a >>>>>> good >>>>>> > idea before adopting the proposal? >>>>>> > >>>>>> > rb >>>>>> > >>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org> >>>>>> wrote: >>>>>> >> >>>>>> >> So there are some minor things (the Where section heading appears >>>>>> to >>>>>> >> be dropped; wherever this document is posted it needs to actually >>>>>> link >>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look >>>>>> like >>>>>> >> I can comment on the google doc. >>>>>> >> >>>>>> >> The major substantive issue that I have is that this version is >>>>>> >> significantly less clear as to the outcome of an SIP. >>>>>> >> >>>>>> >> The apache example of lazy consensus at >>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an >>>>>> >> explicit announcement of an explicit deadline, which I think are >>>>>> >> necessary for clarity. >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> >>>>>> wrote: >>>>>> >> > It turned out suggested edits (trackable) don't show up for >>>>>> non-owners, >>>>>> >> > so >>>>>> >> > I've just merged all the edits in place. It should be visible >>>>>> now. >>>>>> >> > >>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin < >>>>>> r...@databricks.com> >>>>>> >> > wrote: >>>>>> >> >> >>>>>> >> >> Oops. Let me try figure that out. >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> >>>>>> wrote: >>>>>> >> >>> >>>>>> >> >>> Thanks for picking up on this. >>>>>> >> >>> >>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the >>>>>> document >>>>>> >> >>> you linked. >>>>>> >> >>> >>>>>> >> >>> Regarding lazy consensus, if the board in general has less of >>>>>> an issue >>>>>> >> >>> with that, sure. As long as it is clearly announced, lasts at >>>>>> least >>>>>> >> >>> 72 hours, and has a clear outcome. >>>>>> >> >>> >>>>>> >> >>> The other points are hard to comment on without being able to >>>>>> see the >>>>>> >> >>> text in question. >>>>>> >> >>> >>>>>> >> >>> >>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin < >>>>>> r...@databricks.com> >>>>>> >> >>> wrote: >>>>>> >> >>> > I just looked through the entire thread again tonight - >>>>>> there are a >>>>>> >> >>> > lot >>>>>> >> >>> > of >>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the >>>>>> first crack >>>>>> >> >>> > at >>>>>> >> >>> > the >>>>>> >> >>> > proposal. >>>>>> >> >>> > >>>>>> >> >>> > I want to first comment on the context. Spark is one of the >>>>>> most >>>>>> >> >>> > innovative >>>>>> >> >>> > and important projects in (big) data -- overall technical >>>>>> decisions >>>>>> >> >>> > made in >>>>>> >> >>> > Apache Spark are sound. But of course, a project as large >>>>>> and active >>>>>> >> >>> > as >>>>>> >> >>> > Spark always have room for improvement, and we as a >>>>>> community should >>>>>> >> >>> > strive >>>>>> >> >>> > to take it to the next level. >>>>>> >> >>> > >>>>>> >> >>> > To that end, the two biggest areas for improvements in my >>>>>> opinion >>>>>> >> >>> > are: >>>>>> >> >>> > >>>>>> >> >>> > 1. Visibility: There are so much happening that it is >>>>>> difficult to >>>>>> >> >>> > know >>>>>> >> >>> > what >>>>>> >> >>> > really is going on. For people that don't follow closely, it >>>>>> is >>>>>> >> >>> > difficult to >>>>>> >> >>> > know what the important initiatives are. Even for people >>>>>> that do >>>>>> >> >>> > follow, it >>>>>> >> >>> > is difficult to know what specific things require their >>>>>> attention, >>>>>> >> >>> > since the >>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's >>>>>> difficult >>>>>> >> >>> > to >>>>>> >> >>> > extract signal from noise. >>>>>> >> >>> > >>>>>> >> >>> > 2. Solicit user (broadly defined, including developers >>>>>> themselves) >>>>>> >> >>> > input >>>>>> >> >>> > more proactively: At the end of the day the project provides >>>>>> value >>>>>> >> >>> > because >>>>>> >> >>> > users use it. Users can't tell us exactly what to build, but >>>>>> it is >>>>>> >> >>> > important >>>>>> >> >>> > to get their inputs. >>>>>> >> >>> > >>>>>> >> >>> > >>>>>> >> >>> > I've taken Cody's doc and edited it: >>>>>> >> >>> > >>>>>> >> >>> > >>>>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x- >>>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b >>>>>> >> >>> > (I've made all my modifications trackable) >>>>>> >> >>> > >>>>>> >> >>> > There are couple high level changes I made: >>>>>> >> >>> > >>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy >>>>>> consensus >>>>>> >> >>> > as >>>>>> >> >>> > opposed to voting. The reason being in voting there can >>>>>> easily be a >>>>>> >> >>> > "loser' >>>>>> >> >>> > that gets outvoted. >>>>>> >> >>> > >>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to >>>>>> "optional >>>>>> >> >>> > design >>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far >>>>>> aside from >>>>>> >> >>> > tagging >>>>>> >> >>> > things and linking them elsewhere simply having design docs >>>>>> and >>>>>> >> >>> > prototypes >>>>>> >> >>> > implementations in PRs is not something that has not worked >>>>>> so far". >>>>>> >> >>> > >>>>>> >> >>> > 3. I made some the language tweaks to focus more on >>>>>> visibility. For >>>>>> >> >>> > example, >>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather >>>>>> than just >>>>>> >> >>> > "involve". SIPs should also have at least two emails that go >>>>>> to >>>>>> >> >>> > dev@. >>>>>> >> >>> > >>>>>> >> >>> > >>>>>> >> >>> > While I was editing this, I thought we really needed a >>>>>> suggested >>>>>> >> >>> > template >>>>>> >> >>> > for design doc too. I will get to that too ... >>>>>> >> >>> > >>>>>> >> >>> > >>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin < >>>>>> r...@databricks.com> >>>>>> >> >>> > wrote: >>>>>> >> >>> >> >>>>>> >> >>> >> Most things looked OK to me too, although I do plan to take >>>>>> a >>>>>> >> >>> >> closer >>>>>> >> >>> >> look >>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1. >>>>>> >> >>> >> >>>>>> >> >>> >> >>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin >>>>>> >> >>> >> <van...@cloudera.com> >>>>>> >> >>> >> wrote: >>>>>> >> >>> >>> >>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not >>>>>> >> >>> >>> explicitly >>>>>> >> >>> >>> called, that voting would happen by e-mail? A template for >>>>>> the >>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would >>>>>> also be >>>>>> >> >>> >>> nice, >>>>>> >> >>> >>> but that can be done at any time. >>>>>> >> >>> >>> >>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a >>>>>> >> >>> >>> candidate >>>>>> >> >>> >>> for a SIP, given the scope of the work. The document >>>>>> attached even >>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants >>>>>> to try >>>>>> >> >>> >>> out >>>>>> >> >>> >>> the process... >>>>>> >> >>> >>> >>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger >>>>>> >> >>> >>> <c...@koeninger.org> >>>>>> >> >>> >>> wrote: >>>>>> >> >>> >>> > Now that spark summit europe is over, are any committers >>>>>> >> >>> >>> > interested >>>>>> >> >>> >>> > in >>>>>> >> >>> >>> > moving forward with this? >>>>>> >> >>> >>> > >>>>>> >> >>> >>> > >>>>>> >> >>> >>> > >>>>>> >> >>> >>> > >>>>>> >> >>> >>> > https://github.com/koeninger/s >>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md >>>>>> >> >>> >>> > >>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine? >>>>>> >> >>> >>> > >>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >>>>>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote: >>>>>> >> >>> >>> >> Maybe my mail was not clear enough. >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any >>>>>> other >>>>>> >> >>> >>> >> framework. >>>>>> >> >>> >>> >> The >>>>>> >> >>> >>> >> idea with benchmarks was to show two things: >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that >>>>>> Spark is >>>>>> >> >>> >>> >> still on >>>>>> >> >>> >>> >> the >>>>>> >> >>> >>> >> top >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I >>>>>> don't think >>>>>> >> >>> >>> >> they're the >>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page >>>>>> there >>>>>> >> >>> >>> >> is >>>>>> >> >>> >>> >> still >>>>>> >> >>> >>> >> chart >>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that >>>>>> framework is >>>>>> >> >>> >>> >> not >>>>>> >> >>> >>> >> the >>>>>> >> >>> >>> >> same >>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized, >>>>>> comparable >>>>>> >> >>> >>> >> or >>>>>> >> >>> >>> >> even >>>>>> >> >>> >>> >> faster than other frameworks. >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> About real-time streaming, I think it would be just >>>>>> good to see >>>>>> >> >>> >>> >> it >>>>>> >> >>> >>> >> in >>>>>> >> >>> >>> >> Spark. >>>>>> >> >>> >>> >> I very like current Spark model, but many voices that >>>>>> says "we >>>>>> >> >>> >>> >> need >>>>>> >> >>> >>> >> more" - >>>>>> >> >>> >>> >> community should listen also them and try to help them. >>>>>> With >>>>>> >> >>> >>> >> SIPs >>>>>> >> >>> >>> >> it >>>>>> >> >>> >>> >> would >>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing that >>>>>> may be >>>>>> >> >>> >>> >> changed >>>>>> >> >>> >>> >> with >>>>>> >> >>> >>> >> SIP". >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a >>>>>> lot of >>>>>> >> >>> >>> >> algorithms >>>>>> >> >>> >>> >> inside - let's make easy API, but with strong background >>>>>> >> >>> >>> >> (articles, >>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is >>>>>> still >>>>>> >> >>> >>> >> modern >>>>>> >> >>> >>> >> framework. >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said >>>>>> >> >>> >>> >> organizational >>>>>> >> >>> >>> >> ideas >>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail >>>>>> was just >>>>>> >> >>> >>> >> to >>>>>> >> >>> >>> >> show >>>>>> >> >>> >>> >> some >>>>>> >> >>> >>> >> aspects from my side, so from theside of developer and >>>>>> person >>>>>> >> >>> >>> >> who >>>>>> >> >>> >>> >> is >>>>>> >> >>> >>> >> trying >>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other >>>>>> ways) >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> Pozdrawiam / Best regards, >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> Tomasz >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> ________________________________ >>>>>> >> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org> >>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46 >>>>>> >> >>> >>> >> Do: Debasish Das >>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org >>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is >>>>>> missing my >>>>>> >> >>> >>> >> point. >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> My point is evolve or die. Spark's governance and >>>>>> organization >>>>>> >> >>> >>> >> is >>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and it >>>>>> needs >>>>>> >> >>> >>> >> to >>>>>> >> >>> >>> >> change. >>>>>> >> >>> >>> >> >>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >>>>>> >> >>> >>> >> <debasish.da...@gmail.com> >>>>>> >> >>> >>> >> wrote: >>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked >>>>>> up Spark >>>>>> >> >>> >>> >>> in >>>>>> >> >>> >>> >>> 2014 >>>>>> >> >>> >>> >>> as >>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java >>>>>> >> >>> >>> >>> map-reduce >>>>>> >> >>> >>> >>> and >>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code >>>>>> fun...But >>>>>> >> >>> >>> >>> now >>>>>> >> >>> >>> >>> as >>>>>> >> >>> >>> >>> we >>>>>> >> >>> >>> >>> went >>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case >>>>>> gets more >>>>>> >> >>> >>> >>> prominent, I >>>>>> >> >>> >>> >>> think it is time to bring a messaging model in >>>>>> conjunction >>>>>> >> >>> >>> >>> with >>>>>> >> >>> >>> >>> the >>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good >>>>>> at....akka-streams >>>>>> >> >>> >>> >>> close >>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like >>>>>> a great >>>>>> >> >>> >>> >>> direction to >>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 >>>>>> integrated >>>>>> >> >>> >>> >>> streaming >>>>>> >> >>> >>> >>> with >>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is >>>>>> sufficient >>>>>> >> >>> >>> >>> to >>>>>> >> >>> >>> >>> run >>>>>> >> >>> >>> >>> SQL >>>>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL >>>>>> >> >>> >>> >>> processing at >>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ? >>>>>> >> >>> >>> >>> >>>>>> >> >>> >>> >>> After reading the email chain, I started to look into >>>>>> Flink >>>>>> >> >>> >>> >>> documentation >>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I >>>>>> think we >>>>>> >> >>> >>> >>> have >>>>>> >> >>> >>> >>> major >>>>>> >> >>> >>> >>> work >>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more >>>>>> people from >>>>>> >> >>> >>> >>> community >>>>>> >> >>> >>> >>> start >>>>>> >> >>> >>> >>> to take active role in improving the issues so that >>>>>> Spark >>>>>> >> >>> >>> >>> stays >>>>>> >> >>> >>> >>> strong >>>>>> >> >>> >>> >>> compared to Flink. >>>>>> >> >>> >>> >>> >>>>>> >> >>> >>> >>> >>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>> uence/display/SPARK/Spark+Internals >>>>>> >> >>> >>> >>> >>>>>> >> >>> >>> >>> >>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>> uence/display/FLINK/Flink+Internals >>>>>> >> >>> >>> >>> >>>>>> >> >>> >>> >>> Spark is no longer an engine that works for >>>>>> micro-batch and >>>>>> >> >>> >>> >>> batch...We >>>>>> >> >>> >>> >>> (and >>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine >>>>>> for >>>>>> >> >>> >>> >>> stream >>>>>> >> >>> >>> >>> and >>>>>> >> >>> >>> >>> query >>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art >>>>>> engine >>>>>> >> >>> >>> >>> for >>>>>> >> >>> >>> >>> high >>>>>> >> >>> >>> >>> speed >>>>>> >> >>> >>> >>> streaming data and user queries as well ! >>>>>> >> >>> >>> >>> >>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >>>>>> >> >>> >>> >>> <tomasz.gaw...@outlook.com> >>>>>> >> >>> >>> >>> wrote: >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> Hi everyone, >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my >>>>>> suggestions may >>>>>> >> >>> >>> >>>> help a >>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational >>>>>> topics were >>>>>> >> >>> >>> >>>> mentioned, >>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about >>>>>> Spark and >>>>>> >> >>> >>> >>>> about >>>>>> >> >>> >>> >>>> "haters" >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good >>>>>> community >>>>>> >> >>> >>> >>>> - >>>>>> >> >>> >>> >>>> it's >>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on >>>>>> >> >>> >>> >>>> "framework >>>>>> >> >>> >>> >>>> market" >>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big >>>>>> Data >>>>>> >> >>> >>> >>>> communities, >>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :) >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have >>>>>> enough time >>>>>> >> >>> >>> >>>> to >>>>>> >> >>> >>> >>>> join >>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why >>>>>> are >>>>>> >> >>> >>> >>>> some >>>>>> >> >>> >>> >>>> people >>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, >>>>>> like it was >>>>>> >> >>> >>> >>>> posted >>>>>> >> >>> >>> >>>> in >>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is >>>>>> better >>>>>> >> >>> >>> >>>> in >>>>>> >> >>> >>> >>>> all >>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where >>>>>> >> >>> >>> >>>> started >>>>>> >> >>> >>> >>>> after >>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at >>>>>> StackOverflow >>>>>> >> >>> >>> >>>> "Flink >>>>>> >> >>> >>> >>>> vs >>>>>> >> >>> >>> >>>> ...." >>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. >>>>>> Answers are >>>>>> >> >>> >>> >>>> sometimes >>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users >>>>>> (often >>>>>> >> >>> >>> >>>> PMC's) >>>>>> >> >>> >>> >>>> are >>>>>> >> >>> >>> >>>> just posting same information about real-time >>>>>> streaming, >>>>>> >> >>> >>> >>>> about >>>>>> >> >>> >>> >>>> delta >>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is >>>>>> marked as >>>>>> >> >>> >>> >>>> an >>>>>> >> >>> >>> >>>> aswer, >>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the >>>>>> truth. >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and >>>>>> knowledgle to >>>>>> >> >>> >>> >>>> perform >>>>>> >> >>> >>> >>>> huge >>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports >>>>>> Spark >>>>>> >> >>> >>> >>>> (Databricks, >>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in >>>>>> community :) ) >>>>>> >> >>> >>> >>>> could >>>>>> >> >>> >>> >>>> perform performance test of: >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose >>>>>> because of >>>>>> >> >>> >>> >>>> mini-batch >>>>>> >> >>> >>> >>>> model, however currently the difference should be >>>>>> much lower >>>>>> >> >>> >>> >>>> that in >>>>>> >> >>> >>> >>>> previous versions >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> - Machine Learning models >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> - batch jobs >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> - Graph jobs >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> - SQL queries >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a >>>>>> modern >>>>>> >> >>> >>> >>>> framework, >>>>>> >> >>> >>> >>>> because after reading posts mentioned above people >>>>>> may think >>>>>> >> >>> >>> >>>> "it >>>>>> >> >>> >>> >>>> is >>>>>> >> >>> >>> >>>> outdated, future is in framework X". >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how >>>>>> Spark >>>>>> >> >>> >>> >>>> Structured >>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of >>>>>> easy-of-use >>>>>> >> >>> >>> >>>> and >>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various >>>>>> environments >>>>>> >> >>> >>> >>>> (in >>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node >>>>>> cluster, >>>>>> >> >>> >>> >>>> 20-node >>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to >>>>>> say >>>>>> >> >>> >>> >>>> "hey, >>>>>> >> >>> >>> >>>> you're >>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster >>>>>> and is >>>>>> >> >>> >>> >>>> still >>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on >>>>>> facts (just >>>>>> >> >>> >>> >>>> numbers), >>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for >>>>>> marketing >>>>>> >> >>> >>> >>>> puproses >>>>>> >> >>> >>> >>>> and >>>>>> >> >>> >>> >>>> for every Spark developer >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time >>>>>> ago about >>>>>> >> >>> >>> >>>> real-time >>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some >>>>>> work >>>>>> >> >>> >>> >>>> should be >>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's >>>>>> possible. >>>>>> >> >>> >>> >>>> Maybe >>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on >>>>>> top of >>>>>> >> >>> >>> >>>> Akka? >>>>>> >> >>> >>> >>>> I >>>>>> >> >>> >>> >>>> don't >>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think >>>>>> that >>>>>> >> >>> >>> >>>> Spark >>>>>> >> >>> >>> >>>> should >>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many >>>>>> >> >>> >>> >>>> posts/comments >>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is >>>>>> doing >>>>>> >> >>> >>> >>>> very >>>>>> >> >>> >>> >>>> good >>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is >>>>>> possible to >>>>>> >> >>> >>> >>>> add >>>>>> >> >>> >>> >>>> also >>>>>> >> >>> >>> >>>> more >>>>>> >> >>> >>> >>>> real-time processing. >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> Other people said much more and I agree with proposal >>>>>> of SIP. >>>>>> >> >>> >>> >>>> I'm >>>>>> >> >>> >>> >>>> also >>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not >>>>>> listen to >>>>>> >> >>> >>> >>>> users, >>>>>> >> >>> >>> >>>> but >>>>>> >> >>> >>> >>>> they really want to make Spark better for every user. >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> What do you think about these two topics? Especially >>>>>> I'm >>>>>> >> >>> >>> >>>> looking >>>>>> >> >>> >>> >>>> at >>>>>> >> >>> >>> >>>> Cody >>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :) >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards, >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> Tomasz >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>> >>>>>> >> >>> >>> >>>>>> >> >>> >> >>>>>> >> >>> > >>>>>> >> >>> > >>>>>> >> > >>>>>> >> > >>>>>> >> >>>>>> >> ------------------------------------------------------------ >>>>>> --------- >>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >> >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Ryan Blue >>>>>> > Software Engineer >>>>>> > Netflix >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Joseph Bradley >>>>> >>>>> Software Engineer - Machine Learning >>>>> >>>>> Databricks, Inc. >>>>> >>>>> [image: http://databricks.com] <http://databricks.com/> >>>>> >>>> >>>> >>> >> >