Re: Spark Improvement Proposals

Reynold Xin Mon, 07 Nov 2016 10:11:30 -0800

Oops. Let me try figure that out.

On Monday, November 7, 2016, Cody Koeninger <[email protected]> wrote:


> Thanks for picking up on this.
>
> Maybe I fail at google docs, but I can't see any edits on the document
> you linked.
>
> Regarding lazy consensus, if the board in general has less of an issue
> with that, sure.  As long as it is clearly announced, lasts at least
> 72 hours, and has a clear outcome.
>
> The other points are hard to comment on without being able to see the
> text in question.
>
>
> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <[email protected]
> <javascript:;>> wrote:
> > I just looked through the entire thread again tonight - there are a lot
> of
> > great ideas being discussed. Thanks Cody for taking the first crack at
> the
> > proposal.
> >
> > I want to first comment on the context. Spark is one of the most
> innovative
> > and important projects in (big) data -- overall technical decisions made
> in
> > Apache Spark are sound. But of course, a project as large and active as
> > Spark always have room for improvement, and we as a community should
> strive
> > to take it to the next level.
> >
> > To that end, the two biggest areas for improvements in my opinion are:
> >
> > 1. Visibility: There are so much happening that it is difficult to know
> what
> > really is going on. For people that don't follow closely, it is
> difficult to
> > know what the important initiatives are. Even for people that do follow,
> it
> > is difficult to know what specific things require their attention, since
> the
> > number of pull requests and JIRA tickets are high and it's difficult to
> > extract signal from noise.
> >
> > 2. Solicit user (broadly defined, including developers themselves) input
> > more proactively: At the end of the day the project provides value
> because
> > users use it. Users can't tell us exactly what to build, but it is
> important
> > to get their inputs.
> >
> >
> > I've taken Cody's doc and edited it:
> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> > (I've made all my modifications trackable)
> >
> > There are couple high level changes I made:
> >
> > 1. I've consulted a board member and he recommended lazy consensus as
> > opposed to voting. The reason being in voting there can easily be a
> "loser'
> > that gets outvoted.
> >
> > 2. I made it lighter weight, and renamed "strategy" to "optional design
> > sketch". Echoing one of the earlier email: "IMHO so far aside from
> tagging
> > things and linking them elsewhere simply having design docs and
> prototypes
> > implementations in PRs is not something that has not worked so far".
> >
> > 3. I made some the language tweaks to focus more on visibility. For
> example,
> > "The purpose of an SIP is to inform and involve", rather than just
> > "involve". SIPs should also have at least two emails that go to dev@.
> >
> >
> > While I was editing this, I thought we really needed a suggested template
> > for design doc too. I will get to that too ...
> >
> >
> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <[email protected]
> <javascript:;>> wrote:
> >>
> >> Most things looked OK to me too, although I do plan to take a closer
> look
> >> after Nov 1st when we cut the release branch for 2.1.
> >>
> >>
> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <[email protected]
> <javascript:;>>
> >> wrote:
> >>>
> >>> The proposal looks OK to me. I assume, even though it's not explicitly
> >>> called, that voting would happen by e-mail? A template for the
> >>> proposal document (instead of just a bullet nice) would also be nice,
> >>> but that can be done at any time.
> >>>
> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
> >>> for a SIP, given the scope of the work. The document attached even
> >>> somewhat matches the proposed format. So if anyone wants to try out
> >>> the process...
> >>>
> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <[email protected]
> <javascript:;>>
> >>> wrote:
> >>> > Now that spark summit europe is over, are any committers interested
> in
> >>> > moving forward with this?
> >>> >
> >>> >
> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>> >
> >>> > Or are we going to let this discussion die on the vine?
> >>> >
> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >>> > <[email protected] <javascript:;>> wrote:
> >>> >> Maybe my mail was not clear enough.
> >>> >>
> >>> >>
> >>> >> I didn't want to write "lets focus on Flink" or any other framework.
> >>> >> The
> >>> >> idea with benchmarks was to show two things:
> >>> >>
> >>> >> - why some people are doing bad PR for Spark
> >>> >>
> >>> >> - how - in easy way - we can change it and show that Spark is still
> on
> >>> >> the
> >>> >> top
> >>> >>
> >>> >>
> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
> >>> >> they're the
> >>> >> most important thing in Spark :) On the Spark main page there is
> still
> >>> >> chart
> >>> >> "Spark vs Hadoop". It is important to show that framework is not the
> >>> >> same
> >>> >> Spark with other API, but much faster and optimized, comparable or
> >>> >> even
> >>> >> faster than other frameworks.
> >>> >>
> >>> >>
> >>> >> About real-time streaming, I think it would be just good to see it
> in
> >>> >> Spark.
> >>> >> I very like current Spark model, but many voices that says "we need
> >>> >> more" -
> >>> >> community should listen also them and try to help them. With SIPs it
> >>> >> would
> >>> >> be easier, I've just posted this example as "thing that may be
> changed
> >>> >> with
> >>> >> SIP".
> >>> >>
> >>> >>
> >>> >> I very like unification via Datasets, but there is a lot of
> algorithms
> >>> >> inside - let's make easy API, but with strong background (articles,
> >>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
> >>> >> framework.
> >>> >>
> >>> >>
> >>> >> Maybe now my intention will be clearer :) As I said organizational
> >>> >> ideas
> >>> >> were already mentioned and I agree with them, my mail was just to
> show
> >>> >> some
> >>> >> aspects from my side, so from theside of developer and person who is
> >>> >> trying
> >>> >> to help others with Spark (via StackOverflow or other ways)
> >>> >>
> >>> >>
> >>> >> Pozdrawiam / Best regards,
> >>> >>
> >>> >> Tomasz
> >>> >>
> >>> >>
> >>> >> ________________________________
> >>> >> Od: Cody Koeninger <[email protected] <javascript:;>>
> >>> >> Wysłane: 17 października 2016 16:46
> >>> >> Do: Debasish Das
> >>> >> DW: Tomasz Gawęda; [email protected] <javascript:;>
> >>> >> Temat: Re: Spark Improvement Proposals
> >>> >>
> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
> point.
> >>> >>
> >>> >> My point is evolve or die.  Spark's governance and organization is
> >>> >> hampering its ability to evolve technologically, and it needs to
> >>> >> change.
> >>> >>
> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
> >>> >> <[email protected] <javascript:;>>
> >>> >> wrote:
> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in
> 2014
> >>> >>> as
> >>> >>> soon as I looked into it since compared to writing Java map-reduce
> >>> >>> and
> >>> >>> Cascading code, Spark made writing distributed code fun...But now
> as
> >>> >>> we
> >>> >>> went
> >>> >>> deeper with Spark and real-time streaming use-case gets more
> >>> >>> prominent, I
> >>> >>> think it is time to bring a messaging model in conjunction with the
> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close
> >>> >>> integration with spark micro-batching APIs looks like a great
> >>> >>> direction to
> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
> >>> >>> with
> >>> >>> batch with the assumption is that micro-batching is sufficient to
> run
> >>> >>> SQL
> >>> >>> commands on stream but do we really have time to do SQL processing
> at
> >>> >>> streaming data within 1-2 seconds ?
> >>> >>>
> >>> >>> After reading the email chain, I started to look into Flink
> >>> >>> documentation
> >>> >>> and if you compare it with Spark documentation, I think we have
> major
> >>> >>> work
> >>> >>> to do detailing out Spark internals so that more people from
> >>> >>> community
> >>> >>> start
> >>> >>> to take active role in improving the issues so that Spark stays
> >>> >>> strong
> >>> >>> compared to Flink.
> >>> >>>
> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>> >>>
> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>> >>>
> >>> >>> Spark is no longer an engine that works for micro-batch and
> >>> >>> batch...We
> >>> >>> (and
> >>> >>> I am sure many others) are pushing spark as an engine for stream
> and
> >>> >>> query
> >>> >>> processing.....we need to make it a state-of-the-art engine for
> high
> >>> >>> speed
> >>> >>> streaming data and user queries as well !
> >>> >>>
> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >>> >>> <[email protected] <javascript:;>>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> Hi everyone,
> >>> >>>>
> >>> >>>> I'm quite late with my answer, but I think my suggestions may
> help a
> >>> >>>> little bit. :) Many technical and organizational topics were
> >>> >>>> mentioned,
> >>> >>>> but I want to focus on these negative posts about Spark and about
> >>> >>>> "haters"
> >>> >>>>
> >>> >>>> I really like Spark. Easy of use, speed, very good community -
> it's
> >>> >>>> everything here. But Every project has to "flight" on "framework
> >>> >>>> market"
> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> communities,
> >>> >>>> maybe my mail will inspire someone :)
> >>> >>>>
> >>> >>>> You (every Spark developer; so far I didn't have enough time to
> join
> >>> >>>> contributing to Spark) has done excellent job. So why are some
> >>> >>>> people
> >>> >>>> saying that Flink (or other framework) is better, like it was
> posted
> >>> >>>> in
> >>> >>>> this mailing list? No, not because that framework is better in all
> >>> >>>> cases.. In my opinion, many of these discussions where started
> after
> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs
> >>> >>>> ...."
> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> sometimes
> >>> >>>> saying nothing about other frameworks, Flink's users (often PMC's)
> >>> >>>> are
> >>> >>>> just posting same information about real-time streaming, about
> delta
> >>> >>>> iterations, etc. It look smart and very often it is marked as an
> >>> >>>> aswer,
> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >>> >>>>
> >>> >>>>
> >>> >>>> My suggestion: I don't have enough money and knowledgle to perform
> >>> >>>> huge
> >>> >>>> performance test. Maybe some company, that supports Spark
> >>> >>>> (Databricks,
> >>> >>>> Cloudera? - just saying you're most visible in community :) )
> could
> >>> >>>> perform performance test of:
> >>> >>>>
> >>> >>>> - streaming engine - probably Spark will loose because of
> mini-batch
> >>> >>>> model, however currently the difference should be much lower that
> in
> >>> >>>> previous versions
> >>> >>>>
> >>> >>>> - Machine Learning models
> >>> >>>>
> >>> >>>> - batch jobs
> >>> >>>>
> >>> >>>> - Graph jobs
> >>> >>>>
> >>> >>>> - SQL queries
> >>> >>>>
> >>> >>>> People will see that Spark is envolving and is also a modern
> >>> >>>> framework,
> >>> >>>> because after reading posts mentioned above people may think "it
> is
> >>> >>>> outdated, future is in framework X".
> >>> >>>>
> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> Structured
> >>> >>>> Streaming beats every other framework in terms of easy-of-use and
> >>> >>>> reliability. Performance tests, done in various environments (in
> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
> >>> >>>> you're
> >>> >>>> telling that you're better, but Spark is still faster and is still
> >>> >>>> getting even more fast!". This would be based on facts (just
> >>> >>>> numbers),
> >>> >>>> not opinions. It would be good for companies, for marketing
> puproses
> >>> >>>> and
> >>> >>>> for every Spark developer
> >>> >>>>
> >>> >>>>
> >>> >>>> Second: real-time streaming. I've written some time ago about
> >>> >>>> real-time
> >>> >>>> streaming support in Spark Structured Streaming. Some work should
> be
> >>> >>>> done to make SSS more low-latency, but I think it's possible.
> Maybe
> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka? I
> >>> >>>> don't
> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
> >>> >>>> should
> >>> >>>> have real-time streaming support. Currently I see many
> >>> >>>> posts/comments
> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very
> good
> >>> >>>> jobs with micro-batches, however I think it is possible to add
> also
> >>> >>>> more
> >>> >>>> real-time processing.
> >>> >>>>
> >>> >>>> Other people said much more and I agree with proposal of SIP. I'm
> >>> >>>> also
> >>> >>>> happy that PMC's are not saying that they will not listen to
> users,
> >>> >>>> but
> >>> >>>> they really want to make Spark better for every user.
> >>> >>>>
> >>> >>>>
> >>> >>>> What do you think about these two topics? Especially I'm looking
> at
> >>> >>>> Cody
> >>> >>>> (who has started this topic) and PMCs :)
> >>> >>>>
> >>> >>>> Pozdrawiam / Best regards,
> >>> >>>>
> >>> >>>> Tomasz
> >>> >>>>
> >>> >>>>
> >>>
> >>
> >
> >
>

Re: Spark Improvement Proposals

Reply via email to