Re: Spark Improvement Proposals

Cody Koeninger Mon, 02 Jan 2017 07:46:50 -0800

I'm bumping this one more time for the new year, and then I'm giving up.

Please, fix your process, even if it isn't exactly the way I suggested.


On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <[email protected]> wrote:
> On lazy consensus as opposed to voting:
>
> First, why lazy consensus? The proposal was for consensus, which is at least
> three +1 votes and no vetos. Consensus has no losing side, it requires
> getting to a point where there is agreement. Isn't that agreement what we
> want to achieve with these proposals?
>
> Second, lazy consensus only removes the requirement for three +1 votes. Why
> would we not want at least three committers to think something is a good
> idea before adopting the proposal?
>
> rb
>
> On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <[email protected]> wrote:
>>
>> So there are some minor things (the Where section heading appears to
>> be dropped; wherever this document is posted it needs to actually link
>> to a jira filter showing current / past SIPs) but it doesn't look like
>> I can comment on the google doc.
>>
>> The major substantive issue that I have is that this version is
>> significantly less clear as to the outcome of an SIP.
>>
>> The apache example of lazy consensus at
>> http://apache.org/foundation/voting.html#LazyConsensus involves an
>> explicit announcement of an explicit deadline, which I think are
>> necessary for clarity.
>>
>>
>>
>> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <[email protected]> wrote:
>> > It turned out suggested edits (trackable) don't show up for non-owners,
>> > so
>> > I've just merged all the edits in place. It should be visible now.
>> >
>> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <[email protected]>
>> > wrote:
>> >>
>> >> Oops. Let me try figure that out.
>> >>
>> >>
>> >> On Monday, November 7, 2016, Cody Koeninger <[email protected]> wrote:
>> >>>
>> >>> Thanks for picking up on this.
>> >>>
>> >>> Maybe I fail at google docs, but I can't see any edits on the document
>> >>> you linked.
>> >>>
>> >>> Regarding lazy consensus, if the board in general has less of an issue
>> >>> with that, sure.  As long as it is clearly announced, lasts at least
>> >>> 72 hours, and has a clear outcome.
>> >>>
>> >>> The other points are hard to comment on without being able to see the
>> >>> text in question.
>> >>>
>> >>>
>> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <[email protected]>
>> >>> wrote:
>> >>> > I just looked through the entire thread again tonight - there are a
>> >>> > lot
>> >>> > of
>> >>> > great ideas being discussed. Thanks Cody for taking the first crack
>> >>> > at
>> >>> > the
>> >>> > proposal.
>> >>> >
>> >>> > I want to first comment on the context. Spark is one of the most
>> >>> > innovative
>> >>> > and important projects in (big) data -- overall technical decisions
>> >>> > made in
>> >>> > Apache Spark are sound. But of course, a project as large and active
>> >>> > as
>> >>> > Spark always have room for improvement, and we as a community should
>> >>> > strive
>> >>> > to take it to the next level.
>> >>> >
>> >>> > To that end, the two biggest areas for improvements in my opinion
>> >>> > are:
>> >>> >
>> >>> > 1. Visibility: There are so much happening that it is difficult to
>> >>> > know
>> >>> > what
>> >>> > really is going on. For people that don't follow closely, it is
>> >>> > difficult to
>> >>> > know what the important initiatives are. Even for people that do
>> >>> > follow, it
>> >>> > is difficult to know what specific things require their attention,
>> >>> > since the
>> >>> > number of pull requests and JIRA tickets are high and it's difficult
>> >>> > to
>> >>> > extract signal from noise.
>> >>> >
>> >>> > 2. Solicit user (broadly defined, including developers themselves)
>> >>> > input
>> >>> > more proactively: At the end of the day the project provides value
>> >>> > because
>> >>> > users use it. Users can't tell us exactly what to build, but it is
>> >>> > important
>> >>> > to get their inputs.
>> >>> >
>> >>> >
>> >>> > I've taken Cody's doc and edited it:
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>> >>> > (I've made all my modifications trackable)
>> >>> >
>> >>> > There are couple high level changes I made:
>> >>> >
>> >>> > 1. I've consulted a board member and he recommended lazy consensus
>> >>> > as
>> >>> > opposed to voting. The reason being in voting there can easily be a
>> >>> > "loser'
>> >>> > that gets outvoted.
>> >>> >
>> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
>> >>> > design
>> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>> >>> > tagging
>> >>> > things and linking them elsewhere simply having design docs and
>> >>> > prototypes
>> >>> > implementations in PRs is not something that has not worked so far".
>> >>> >
>> >>> > 3. I made some the language tweaks to focus more on visibility. For
>> >>> > example,
>> >>> > "The purpose of an SIP is to inform and involve", rather than just
>> >>> > "involve". SIPs should also have at least two emails that go to
>> >>> > dev@.
>> >>> >
>> >>> >
>> >>> > While I was editing this, I thought we really needed a suggested
>> >>> > template
>> >>> > for design doc too. I will get to that too ...
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <[email protected]>
>> >>> > wrote:
>> >>> >>
>> >>> >> Most things looked OK to me too, although I do plan to take a
>> >>> >> closer
>> >>> >> look
>> >>> >> after Nov 1st when we cut the release branch for 2.1.
>> >>> >>
>> >>> >>
>> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>> >>> >> <[email protected]>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> The proposal looks OK to me. I assume, even though it's not
>> >>> >>> explicitly
>> >>> >>> called, that voting would happen by e-mail? A template for the
>> >>> >>> proposal document (instead of just a bullet nice) would also be
>> >>> >>> nice,
>> >>> >>> but that can be done at any time.
>> >>> >>>
>> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>> >>> >>> candidate
>> >>> >>> for a SIP, given the scope of the work. The document attached even
>> >>> >>> somewhat matches the proposed format. So if anyone wants to try
>> >>> >>> out
>> >>> >>> the process...
>> >>> >>>
>> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>> >>> >>> <[email protected]>
>> >>> >>> wrote:
>> >>> >>> > Now that spark summit europe is over, are any committers
>> >>> >>> > interested
>> >>> >>> > in
>> >>> >>> > moving forward with this?
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >>> >
>> >>> >>> > Or are we going to let this discussion die on the vine?
>> >>> >>> >
>> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >>> >>> > <[email protected]> wrote:
>> >>> >>> >> Maybe my mail was not clear enough.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> >>> >>> >> framework.
>> >>> >>> >> The
>> >>> >>> >> idea with benchmarks was to show two things:
>> >>> >>> >>
>> >>> >>> >> - why some people are doing bad PR for Spark
>> >>> >>> >>
>> >>> >>> >> - how - in easy way - we can change it and show that Spark is
>> >>> >>> >> still on
>> >>> >>> >> the
>> >>> >>> >> top
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> >>> >>> >> they're the
>> >>> >>> >> most important thing in Spark :) On the Spark main page there
>> >>> >>> >> is
>> >>> >>> >> still
>> >>> >>> >> chart
>> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is
>> >>> >>> >> not
>> >>> >>> >> the
>> >>> >>> >> same
>> >>> >>> >> Spark with other API, but much faster and optimized, comparable
>> >>> >>> >> or
>> >>> >>> >> even
>> >>> >>> >> faster than other frameworks.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> About real-time streaming, I think it would be just good to see
>> >>> >>> >> it
>> >>> >>> >> in
>> >>> >>> >> Spark.
>> >>> >>> >> I very like current Spark model, but many voices that says "we
>> >>> >>> >> need
>> >>> >>> >> more" -
>> >>> >>> >> community should listen also them and try to help them. With
>> >>> >>> >> SIPs
>> >>> >>> >> it
>> >>> >>> >> would
>> >>> >>> >> be easier, I've just posted this example as "thing that may be
>> >>> >>> >> changed
>> >>> >>> >> with
>> >>> >>> >> SIP".
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> I very like unification via Datasets, but there is a lot of
>> >>> >>> >> algorithms
>> >>> >>> >> inside - let's make easy API, but with strong background
>> >>> >>> >> (articles,
>> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
>> >>> >>> >> modern
>> >>> >>> >> framework.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Maybe now my intention will be clearer :) As I said
>> >>> >>> >> organizational
>> >>> >>> >> ideas
>> >>> >>> >> were already mentioned and I agree with them, my mail was just
>> >>> >>> >> to
>> >>> >>> >> show
>> >>> >>> >> some
>> >>> >>> >> aspects from my side, so from theside of developer and person
>> >>> >>> >> who
>> >>> >>> >> is
>> >>> >>> >> trying
>> >>> >>> >> to help others with Spark (via StackOverflow or other ways)
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Pozdrawiam / Best regards,
>> >>> >>> >>
>> >>> >>> >> Tomasz
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> ________________________________
>> >>> >>> >> Od: Cody Koeninger <[email protected]>
>> >>> >>> >> Wysłane: 17 października 2016 16:46
>> >>> >>> >> Do: Debasish Das
>> >>> >>> >> DW: Tomasz Gawęda; [email protected]
>> >>> >>> >> Temat: Re: Spark Improvement Proposals
>> >>> >>> >>
>> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
>> >>> >>> >> point.
>> >>> >>> >>
>> >>> >>> >> My point is evolve or die.  Spark's governance and organization
>> >>> >>> >> is
>> >>> >>> >> hampering its ability to evolve technologically, and it needs
>> >>> >>> >> to
>> >>> >>> >> change.
>> >>> >>> >>
>> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>> >>> >>> >> <[email protected]>
>> >>> >>> >> wrote:
>> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark
>> >>> >>> >>> in
>> >>> >>> >>> 2014
>> >>> >>> >>> as
>> >>> >>> >>> soon as I looked into it since compared to writing Java
>> >>> >>> >>> map-reduce
>> >>> >>> >>> and
>> >>> >>> >>> Cascading code, Spark made writing distributed code fun...But
>> >>> >>> >>> now
>> >>> >>> >>> as
>> >>> >>> >>> we
>> >>> >>> >>> went
>> >>> >>> >>> deeper with Spark and real-time streaming use-case gets more
>> >>> >>> >>> prominent, I
>> >>> >>> >>> think it is time to bring a messaging model in conjunction
>> >>> >>> >>> with
>> >>> >>> >>> the
>> >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams
>> >>> >>> >>> close
>> >>> >>> >>> integration with spark micro-batching APIs looks like a great
>> >>> >>> >>> direction to
>> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>> >>> >>> >>> streaming
>> >>> >>> >>> with
>> >>> >>> >>> batch with the assumption is that micro-batching is sufficient
>> >>> >>> >>> to
>> >>> >>> >>> run
>> >>> >>> >>> SQL
>> >>> >>> >>> commands on stream but do we really have time to do SQL
>> >>> >>> >>> processing at
>> >>> >>> >>> streaming data within 1-2 seconds ?
>> >>> >>> >>>
>> >>> >>> >>> After reading the email chain, I started to look into Flink
>> >>> >>> >>> documentation
>> >>> >>> >>> and if you compare it with Spark documentation, I think we
>> >>> >>> >>> have
>> >>> >>> >>> major
>> >>> >>> >>> work
>> >>> >>> >>> to do detailing out Spark internals so that more people from
>> >>> >>> >>> community
>> >>> >>> >>> start
>> >>> >>> >>> to take active role in improving the issues so that Spark
>> >>> >>> >>> stays
>> >>> >>> >>> strong
>> >>> >>> >>> compared to Flink.
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>> >>> >>> >>>
>> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
>> >>> >>> >>> batch...We
>> >>> >>> >>> (and
>> >>> >>> >>> I am sure many others) are pushing spark as an engine for
>> >>> >>> >>> stream
>> >>> >>> >>> and
>> >>> >>> >>> query
>> >>> >>> >>> processing.....we need to make it a state-of-the-art engine
>> >>> >>> >>> for
>> >>> >>> >>> high
>> >>> >>> >>> speed
>> >>> >>> >>> streaming data and user queries as well !
>> >>> >>> >>>
>> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>> >>> >>> >>> <[email protected]>
>> >>> >>> >>> wrote:
>> >>> >>> >>>>
>> >>> >>> >>>> Hi everyone,
>> >>> >>> >>>>
>> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
>> >>> >>> >>>> help a
>> >>> >>> >>>> little bit. :) Many technical and organizational topics were
>> >>> >>> >>>> mentioned,
>> >>> >>> >>>> but I want to focus on these negative posts about Spark and
>> >>> >>> >>>> about
>> >>> >>> >>>> "haters"
>> >>> >>> >>>>
>> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community
>> >>> >>> >>>> -
>> >>> >>> >>>> it's
>> >>> >>> >>>> everything here. But Every project has to "flight" on
>> >>> >>> >>>> "framework
>> >>> >>> >>>> market"
>> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>> >>> >>> >>>> communities,
>> >>> >>> >>>> maybe my mail will inspire someone :)
>> >>> >>> >>>>
>> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time
>> >>> >>> >>>> to
>> >>> >>> >>>> join
>> >>> >>> >>>> contributing to Spark) has done excellent job. So why are
>> >>> >>> >>>> some
>> >>> >>> >>>> people
>> >>> >>> >>>> saying that Flink (or other framework) is better, like it was
>> >>> >>> >>>> posted
>> >>> >>> >>>> in
>> >>> >>> >>>> this mailing list? No, not because that framework is better
>> >>> >>> >>>> in
>> >>> >>> >>>> all
>> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>> >>> >>> >>>> started
>> >>> >>> >>>> after
>> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
>> >>> >>> >>>> "Flink
>> >>> >>> >>>> vs
>> >>> >>> >>>> ...."
>> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
>> >>> >>> >>>> sometimes
>> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
>> >>> >>> >>>> PMC's)
>> >>> >>> >>>> are
>> >>> >>> >>>> just posting same information about real-time streaming,
>> >>> >>> >>>> about
>> >>> >>> >>>> delta
>> >>> >>> >>>> iterations, etc. It look smart and very often it is marked as
>> >>> >>> >>>> an
>> >>> >>> >>>> aswer,
>> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
>> >>> >>> >>>> perform
>> >>> >>> >>>> huge
>> >>> >>> >>>> performance test. Maybe some company, that supports Spark
>> >>> >>> >>>> (Databricks,
>> >>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
>> >>> >>> >>>> could
>> >>> >>> >>>> perform performance test of:
>> >>> >>> >>>>
>> >>> >>> >>>> - streaming engine - probably Spark will loose because of
>> >>> >>> >>>> mini-batch
>> >>> >>> >>>> model, however currently the difference should be much lower
>> >>> >>> >>>> that in
>> >>> >>> >>>> previous versions
>> >>> >>> >>>>
>> >>> >>> >>>> - Machine Learning models
>> >>> >>> >>>>
>> >>> >>> >>>> - batch jobs
>> >>> >>> >>>>
>> >>> >>> >>>> - Graph jobs
>> >>> >>> >>>>
>> >>> >>> >>>> - SQL queries
>> >>> >>> >>>>
>> >>> >>> >>>> People will see that Spark is envolving and is also a modern
>> >>> >>> >>>> framework,
>> >>> >>> >>>> because after reading posts mentioned above people may think
>> >>> >>> >>>> "it
>> >>> >>> >>>> is
>> >>> >>> >>>> outdated, future is in framework X".
>> >>> >>> >>>>
>> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>> >>> >>> >>>> Structured
>> >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
>> >>> >>> >>>> and
>> >>> >>> >>>> reliability. Performance tests, done in various environments
>> >>> >>> >>>> (in
>> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
>> >>> >>> >>>> 20-node
>> >>> >>> >>>> cluster), could be also very good marketing stuff to say
>> >>> >>> >>>> "hey,
>> >>> >>> >>>> you're
>> >>> >>> >>>> telling that you're better, but Spark is still faster and is
>> >>> >>> >>>> still
>> >>> >>> >>>> getting even more fast!". This would be based on facts (just
>> >>> >>> >>>> numbers),
>> >>> >>> >>>> not opinions. It would be good for companies, for marketing
>> >>> >>> >>>> puproses
>> >>> >>> >>>> and
>> >>> >>> >>>> for every Spark developer
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>> >>>> Second: real-time streaming. I've written some time ago about
>> >>> >>> >>>> real-time
>> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
>> >>> >>> >>>> should be
>> >>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
>> >>> >>> >>>> Maybe
>> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
>> >>> >>> >>>> Akka?
>> >>> >>> >>>> I
>> >>> >>> >>>> don't
>> >>> >>> >>>> know yet, it is good topic for SIP. However I think that
>> >>> >>> >>>> Spark
>> >>> >>> >>>> should
>> >>> >>> >>>> have real-time streaming support. Currently I see many
>> >>> >>> >>>> posts/comments
>> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
>> >>> >>> >>>> very
>> >>> >>> >>>> good
>> >>> >>> >>>> jobs with micro-batches, however I think it is possible to
>> >>> >>> >>>> add
>> >>> >>> >>>> also
>> >>> >>> >>>> more
>> >>> >>> >>>> real-time processing.
>> >>> >>> >>>>
>> >>> >>> >>>> Other people said much more and I agree with proposal of SIP.
>> >>> >>> >>>> I'm
>> >>> >>> >>>> also
>> >>> >>> >>>> happy that PMC's are not saying that they will not listen to
>> >>> >>> >>>> users,
>> >>> >>> >>>> but
>> >>> >>> >>>> they really want to make Spark better for every user.
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>> >>>> What do you think about these two topics? Especially I'm
>> >>> >>> >>>> looking
>> >>> >>> >>>> at
>> >>> >>> >>>> Cody
>> >>> >>> >>>> (who has started this topic) and PMCs :)
>> >>> >>> >>>>
>> >>> >>> >>>> Pozdrawiam / Best regards,
>> >>> >>> >>>>
>> >>> >>> >>>> Tomasz
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>> >
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Spark Improvement Proposals

Reply via email to