Re: Spark Improvement Proposals

Cody Koeninger Sat, 11 Feb 2017 07:58:39 -0800

At the spark summit this week, everyone from PMC members to users I had
never met before were asking me about the Spark improvement proposals
idea.  It's clear that it's a real community need.


But it's been almost half a year, and nothing visible has been done.

Reynold, are you going to do this?

If so, when?

If not, why?

You already did the right thing by including long-deserved committers.
Please keep doing the right thing for the community.

On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <[email protected]> wrote:

> +1 on all counts (consensus, time bound, define roles)
>
> I can update the doc in the next few days and share back. Then maybe we
> can just officially vote on this. As Tim suggested, we might not get it
> 100% right the first time and would need to re-iterate. But that's fine.
>
>
> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <[email protected]>
> wrote:
>
>> Hi Cody,
>> thank you for bringing up this topic, I agree it is very important to
>> keep a cohesive community around some common, fluid goals. Here are a few
>> comments about the current document:
>>
>> 1. name: it should not overlap with an existing one such as SIP. Can you
>> imagine someone trying to discuss a scala spore proposal for spark?
>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>> sounds great.
>>
>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>> technical decisions with a lasting impact. As such, the template should
>> emphasize the role of the various parties during this process:
>>
>>  - the SPIP author is responsible for building consensus. She is the
>> champion driving the process forward and is responsible for ensuring that
>> the SPIP follows the general guidelines. The author should be identified in
>> the SPIP. The authorship of a SPIP can be transferred if the current author
>> is not interested and someone else wants to move the SPIP forward. There
>> should probably be 2-3 authors at most for each SPIP.
>>
>>  - someone with voting power should probably shepherd the SPIP (and be
>> recorded as such): ensuring that the final decision over the SPIP is
>> recorded (rejected, accepted, etc.), and advising about the technical
>> quality of the SPIP: this person need not be a champion for the SPIP or
>> contribute to it, but rather makes sure it stands a chance of being
>> approved when the vote happens. Also, if the author cannot find anyone who
>> would want to take this role, this proposal is likely to be rejected anyway.
>>
>>  - users, committers, contributors have the roles already outlined in the
>> document
>>
>> 3. timeline: ideally, once a SPIP has been offered for voting, it should
>> move swiftly into either being accepted or rejected, so that we do not end
>> up with a distracting long tail of half-hearted proposals.
>>
>> These rules are meant to be flexible, but the current document should be
>> clear about who is in charge of a SPIP, and the state it is currently in.
>>
>> We have had long discussions over some very important questions such as
>> approval. I do not have an opinion on these, but why not make a pick and
>> reevaluate this decision later? This is not a binding process at this point.
>>
>> Tim
>>
>>
>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <[email protected]>
>> wrote:
>>
>>> I don't have a concern about voting vs consensus.
>>>
>>> I have a concern that whatever the decision making process is, it is
>>> explicitly announced on the ticket for the given proposal, with an explicit
>>> deadline, and an explicit outcome.
>>>
>>>
>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <[email protected]>
>>> wrote:
>>>
>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>
>>>> My take on the specific issues Joseph mentioned:
>>>>
>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>> earlier for consensus:
>>>>
>>>> > Majority vs consensus: My rationale is that I don't think we want to
>>>> consider a proposal approved if it had objections serious enough that
>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>> proposals are like PEPs, then they represent a significant amount of
>>>> community effort and I wouldn't want to move forward if up to half of the
>>>> community thinks it's an untenable idea.
>>>>
>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>
>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>
>>>> One small addition:
>>>>
>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>>> no one has objected.  (don't care enough that I'd object to anything else,
>>>> though.)
>>>>
>>>>
>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Cody,
>>>>>
>>>>> Thanks for being persistent about this.  I too would like to see this
>>>>> happen.  Reviewing the thread, it sounds like the main things remaining 
>>>>> are:
>>>>> * Decide about a few issues
>>>>> * Finalize the doc(s)
>>>>> * Vote on this proposal
>>>>>
>>>>> Issues & TODOs:
>>>>>
>>>>> (1) The main issue I see above is voting vs. consensus.  I have little
>>>>> preference here.  It sounds like something which could be tailored based 
>>>>> on
>>>>> whether we see too many or too few SIPs being approved.
>>>>>
>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>> regardless of this SIP discussion.)
>>>>> * Reynold, are you still putting this together?
>>>>>
>>>>> (3) Template cleanups.  Listing some items mentioned above + a new one
>>>>> w.r.t. Reynold's draft
>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>> :
>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>> * Add field for stating explicit deadlines for approval
>>>>> * Add field for stating Author & Committer shepherd
>>>>>
>>>>> Thanks all!
>>>>> Joseph
>>>>>
>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I'm bumping this one more time for the new year, and then I'm giving
>>>>>> up.
>>>>>>
>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>> suggested.
>>>>>>
>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <[email protected]> wrote:
>>>>>> > On lazy consensus as opposed to voting:
>>>>>> >
>>>>>> > First, why lazy consensus? The proposal was for consensus, which is
>>>>>> at least
>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>> requires
>>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>>> what we
>>>>>> > want to achieve with these proposals?
>>>>>> >
>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>> votes. Why
>>>>>> > would we not want at least three committers to think something is a
>>>>>> good
>>>>>> > idea before adopting the proposal?
>>>>>> >
>>>>>> > rb
>>>>>> >
>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> So there are some minor things (the Where section heading appears
>>>>>> to
>>>>>> >> be dropped; wherever this document is posted it needs to actually
>>>>>> link
>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>>>> like
>>>>>> >> I can comment on the google doc.
>>>>>> >>
>>>>>> >> The major substantive issue that I have is that this version is
>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>> >>
>>>>>> >> The apache example of lazy consensus at
>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>> >> necessary for clarity.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <[email protected]>
>>>>>> wrote:
>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>> non-owners,
>>>>>> >> > so
>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>> now.
>>>>>> >> >
>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>> [email protected]>
>>>>>> >> > wrote:
>>>>>> >> >>
>>>>>> >> >> Oops. Let me try figure that out.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <[email protected]>
>>>>>> wrote:
>>>>>> >> >>>
>>>>>> >> >>> Thanks for picking up on this.
>>>>>> >> >>>
>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>>>>> document
>>>>>> >> >>> you linked.
>>>>>> >> >>>
>>>>>> >> >>> Regarding lazy consensus, if the board in general has less of
>>>>>> an issue
>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts at
>>>>>> least
>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>> >> >>>
>>>>>> >> >>> The other points are hard to comment on without being able to
>>>>>> see the
>>>>>> >> >>> text in question.
>>>>>> >> >>>
>>>>>> >> >>>
>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>> [email protected]>
>>>>>> >> >>> wrote:
>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>> there are a
>>>>>> >> >>> > lot
>>>>>> >> >>> > of
>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>> first crack
>>>>>> >> >>> > at
>>>>>> >> >>> > the
>>>>>> >> >>> > proposal.
>>>>>> >> >>> >
>>>>>> >> >>> > I want to first comment on the context. Spark is one of the
>>>>>> most
>>>>>> >> >>> > innovative
>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>> decisions
>>>>>> >> >>> > made in
>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>> and active
>>>>>> >> >>> > as
>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>> community should
>>>>>> >> >>> > strive
>>>>>> >> >>> > to take it to the next level.
>>>>>> >> >>> >
>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>> opinion
>>>>>> >> >>> > are:
>>>>>> >> >>> >
>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>> difficult to
>>>>>> >> >>> > know
>>>>>> >> >>> > what
>>>>>> >> >>> > really is going on. For people that don't follow closely, it
>>>>>> is
>>>>>> >> >>> > difficult to
>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>> that do
>>>>>> >> >>> > follow, it
>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>> attention,
>>>>>> >> >>> > since the
>>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>>> difficult
>>>>>> >> >>> > to
>>>>>> >> >>> > extract signal from noise.
>>>>>> >> >>> >
>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>> themselves)
>>>>>> >> >>> > input
>>>>>> >> >>> > more proactively: At the end of the day the project provides
>>>>>> value
>>>>>> >> >>> > because
>>>>>> >> >>> > users use it. Users can't tell us exactly what to build, but
>>>>>> it is
>>>>>> >> >>> > important
>>>>>> >> >>> > to get their inputs.
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>> >> >>> >
>>>>>> >> >>> > There are couple high level changes I made:
>>>>>> >> >>> >
>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>> consensus
>>>>>> >> >>> > as
>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>> easily be a
>>>>>> >> >>> > "loser'
>>>>>> >> >>> > that gets outvoted.
>>>>>> >> >>> >
>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>> "optional
>>>>>> >> >>> > design
>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>> aside from
>>>>>> >> >>> > tagging
>>>>>> >> >>> > things and linking them elsewhere simply having design docs
>>>>>> and
>>>>>> >> >>> > prototypes
>>>>>> >> >>> > implementations in PRs is not something that has not worked
>>>>>> so far".
>>>>>> >> >>> >
>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>> visibility. For
>>>>>> >> >>> > example,
>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>> than just
>>>>>> >> >>> > "involve". SIPs should also have at least two emails that go
>>>>>> to
>>>>>> >> >>> > dev@.
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>> suggested
>>>>>> >> >>> > template
>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>> [email protected]>
>>>>>> >> >>> > wrote:
>>>>>> >> >>> >>
>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to take
>>>>>> a
>>>>>> >> >>> >> closer
>>>>>> >> >>> >> look
>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>> >> >>> >>
>>>>>> >> >>> >>
>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>> >> >>> >> <[email protected]>
>>>>>> >> >>> >> wrote:
>>>>>> >> >>> >>>
>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>>>>> >> >>> >>> explicitly
>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template for
>>>>>> the
>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>> also be
>>>>>> >> >>> >>> nice,
>>>>>> >> >>> >>> but that can be done at any time.
>>>>>> >> >>> >>>
>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>>>>> >> >>> >>> candidate
>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>> attached even
>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants
>>>>>> to try
>>>>>> >> >>> >>> out
>>>>>> >> >>> >>> the process...
>>>>>> >> >>> >>>
>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>> >> >>> >>> <[email protected]>
>>>>>> >> >>> >>> wrote:
>>>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>>>> >> >>> >>> > interested
>>>>>> >> >>> >>> > in
>>>>>> >> >>> >>> > moving forward with this?
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>> >> >>> >>> > <[email protected]> wrote:
>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>> other
>>>>>> >> >>> >>> >> framework.
>>>>>> >> >>> >>> >> The
>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>> Spark is
>>>>>> >> >>> >>> >> still on
>>>>>> >> >>> >>> >> the
>>>>>> >> >>> >>> >> top
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>> don't think
>>>>>> >> >>> >>> >> they're the
>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page
>>>>>> there
>>>>>> >> >>> >>> >> is
>>>>>> >> >>> >>> >> still
>>>>>> >> >>> >>> >> chart
>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>> framework is
>>>>>> >> >>> >>> >> not
>>>>>> >> >>> >>> >> the
>>>>>> >> >>> >>> >> same
>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>> comparable
>>>>>> >> >>> >>> >> or
>>>>>> >> >>> >>> >> even
>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>> good to see
>>>>>> >> >>> >>> >> it
>>>>>> >> >>> >>> >> in
>>>>>> >> >>> >>> >> Spark.
>>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>>> says "we
>>>>>> >> >>> >>> >> need
>>>>>> >> >>> >>> >> more" -
>>>>>> >> >>> >>> >> community should listen also them and try to help them.
>>>>>> With
>>>>>> >> >>> >>> >> SIPs
>>>>>> >> >>> >>> >> it
>>>>>> >> >>> >>> >> would
>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing that
>>>>>> may be
>>>>>> >> >>> >>> >> changed
>>>>>> >> >>> >>> >> with
>>>>>> >> >>> >>> >> SIP".
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>> lot of
>>>>>> >> >>> >>> >> algorithms
>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong background
>>>>>> >> >>> >>> >> (articles,
>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is
>>>>>> still
>>>>>> >> >>> >>> >> modern
>>>>>> >> >>> >>> >> framework.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>> >> >>> >>> >> organizational
>>>>>> >> >>> >>> >> ideas
>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>>> was just
>>>>>> >> >>> >>> >> to
>>>>>> >> >>> >>> >> show
>>>>>> >> >>> >>> >> some
>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>>>>> person
>>>>>> >> >>> >>> >> who
>>>>>> >> >>> >>> >> is
>>>>>> >> >>> >>> >> trying
>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>>> ways)
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> Tomasz
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> ________________________________
>>>>>> >> >>> >>> >> Od: Cody Koeninger <[email protected]>
>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; [email protected]
>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>> missing my
>>>>>> >> >>> >>> >> point.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>> organization
>>>>>> >> >>> >>> >> is
>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and it
>>>>>> needs
>>>>>> >> >>> >>> >> to
>>>>>> >> >>> >>> >> change.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>> >> >>> >>> >> <[email protected]>
>>>>>> >> >>> >>> >> wrote:
>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked
>>>>>> up Spark
>>>>>> >> >>> >>> >>> in
>>>>>> >> >>> >>> >>> 2014
>>>>>> >> >>> >>> >>> as
>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java
>>>>>> >> >>> >>> >>> map-reduce
>>>>>> >> >>> >>> >>> and
>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>> fun...But
>>>>>> >> >>> >>> >>> now
>>>>>> >> >>> >>> >>> as
>>>>>> >> >>> >>> >>> we
>>>>>> >> >>> >>> >>> went
>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>> gets more
>>>>>> >> >>> >>> >>> prominent, I
>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>> conjunction
>>>>>> >> >>> >>> >>> with
>>>>>> >> >>> >>> >>> the
>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>> at....akka-streams
>>>>>> >> >>> >>> >>> close
>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like
>>>>>> a great
>>>>>> >> >>> >>> >>> direction to
>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>> integrated
>>>>>> >> >>> >>> >>> streaming
>>>>>> >> >>> >>> >>> with
>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>> sufficient
>>>>>> >> >>> >>> >>> to
>>>>>> >> >>> >>> >>> run
>>>>>> >> >>> >>> >>> SQL
>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL
>>>>>> >> >>> >>> >>> processing at
>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>>>>> Flink
>>>>>> >> >>> >>> >>> documentation
>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>> think we
>>>>>> >> >>> >>> >>> have
>>>>>> >> >>> >>> >>> major
>>>>>> >> >>> >>> >>> work
>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>> people from
>>>>>> >> >>> >>> >>> community
>>>>>> >> >>> >>> >>> start
>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>> Spark
>>>>>> >> >>> >>> >>> stays
>>>>>> >> >>> >>> >>> strong
>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>> uence/display/SPARK/Spark+Internals
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>> uence/display/FLINK/Flink+Internals
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>> micro-batch and
>>>>>> >> >>> >>> >>> batch...We
>>>>>> >> >>> >>> >>> (and
>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine
>>>>>> for
>>>>>> >> >>> >>> >>> stream
>>>>>> >> >>> >>> >>> and
>>>>>> >> >>> >>> >>> query
>>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>>> engine
>>>>>> >> >>> >>> >>> for
>>>>>> >> >>> >>> >>> high
>>>>>> >> >>> >>> >>> speed
>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>> >> >>> >>> >>> <[email protected]>
>>>>>> >> >>> >>> >>> wrote:
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>> suggestions may
>>>>>> >> >>> >>> >>>> help a
>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>> topics were
>>>>>> >> >>> >>> >>>> mentioned,
>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>> Spark and
>>>>>> >> >>> >>> >>>> about
>>>>>> >> >>> >>> >>>> "haters"
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>> community
>>>>>> >> >>> >>> >>>> -
>>>>>> >> >>> >>> >>>> it's
>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>>>>> >> >>> >>> >>>> "framework
>>>>>> >> >>> >>> >>>> market"
>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>> Data
>>>>>> >> >>> >>> >>>> communities,
>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>> enough time
>>>>>> >> >>> >>> >>>> to
>>>>>> >> >>> >>> >>>> join
>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why
>>>>>> are
>>>>>> >> >>> >>> >>>> some
>>>>>> >> >>> >>> >>>> people
>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>> like it was
>>>>>> >> >>> >>> >>>> posted
>>>>>> >> >>> >>> >>>> in
>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>>>>> better
>>>>>> >> >>> >>> >>>> in
>>>>>> >> >>> >>> >>>> all
>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>>>>>> >> >>> >>> >>>> started
>>>>>> >> >>> >>> >>>> after
>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>> StackOverflow
>>>>>> >> >>> >>> >>>> "Flink
>>>>>> >> >>> >>> >>>> vs
>>>>>> >> >>> >>> >>>> ...."
>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>> Answers are
>>>>>> >> >>> >>> >>>> sometimes
>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>>>>> (often
>>>>>> >> >>> >>> >>>> PMC's)
>>>>>> >> >>> >>> >>>> are
>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>> streaming,
>>>>>> >> >>> >>> >>>> about
>>>>>> >> >>> >>> >>>> delta
>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>>> marked as
>>>>>> >> >>> >>> >>>> an
>>>>>> >> >>> >>> >>>> aswer,
>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>>> truth.
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>> knowledgle to
>>>>>> >> >>> >>> >>>> perform
>>>>>> >> >>> >>> >>>> huge
>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>>> Spark
>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>> community :) )
>>>>>> >> >>> >>> >>>> could
>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>> because of
>>>>>> >> >>> >>> >>>> mini-batch
>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>> much lower
>>>>>> >> >>> >>> >>>> that in
>>>>>> >> >>> >>> >>>> previous versions
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a
>>>>>> modern
>>>>>> >> >>> >>> >>>> framework,
>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>> may think
>>>>>> >> >>> >>> >>>> "it
>>>>>> >> >>> >>> >>>> is
>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>> Spark
>>>>>> >> >>> >>> >>>> Structured
>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>> easy-of-use
>>>>>> >> >>> >>> >>>> and
>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>> environments
>>>>>> >> >>> >>> >>>> (in
>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>> cluster,
>>>>>> >> >>> >>> >>>> 20-node
>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to
>>>>>> say
>>>>>> >> >>> >>> >>>> "hey,
>>>>>> >> >>> >>> >>>> you're
>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster
>>>>>> and is
>>>>>> >> >>> >>> >>>> still
>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>> facts (just
>>>>>> >> >>> >>> >>>> numbers),
>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>> marketing
>>>>>> >> >>> >>> >>>> puproses
>>>>>> >> >>> >>> >>>> and
>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>>> ago about
>>>>>> >> >>> >>> >>>> real-time
>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some
>>>>>> work
>>>>>> >> >>> >>> >>>> should be
>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>>> possible.
>>>>>> >> >>> >>> >>>> Maybe
>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>> top of
>>>>>> >> >>> >>> >>>> Akka?
>>>>>> >> >>> >>> >>>> I
>>>>>> >> >>> >>> >>>> don't
>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>>> that
>>>>>> >> >>> >>> >>>> Spark
>>>>>> >> >>> >>> >>>> should
>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many
>>>>>> >> >>> >>> >>>> posts/comments
>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>>>>> doing
>>>>>> >> >>> >>> >>>> very
>>>>>> >> >>> >>> >>>> good
>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>> possible to
>>>>>> >> >>> >>> >>>> add
>>>>>> >> >>> >>> >>>> also
>>>>>> >> >>> >>> >>>> more
>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Other people said much more and I agree with proposal
>>>>>> of SIP.
>>>>>> >> >>> >>> >>>> I'm
>>>>>> >> >>> >>> >>>> also
>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>> listen to
>>>>>> >> >>> >>> >>>> users,
>>>>>> >> >>> >>> >>>> but
>>>>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> What do you think about these two topics? Especially
>>>>>> I'm
>>>>>> >> >>> >>> >>>> looking
>>>>>> >> >>> >>> >>>> at
>>>>>> >> >>> >>> >>>> Cody
>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Tomasz
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>>
>>>>>> >> >>> >>
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >>
>>>>>> >> ------------------------------------------------------------
>>>>>> ---------
>>>>>> >> To unsubscribe e-mail: [email protected]
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Ryan Blue
>>>>>> > Software Engineer
>>>>>> > Netflix
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Joseph Bradley
>>>>>
>>>>> Software Engineer - Machine Learning
>>>>>
>>>>> Databricks, Inc.
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Reply via email to