Re: Spark Improvement Proposals

Reynold Xin Mon, 07 Nov 2016 11:56:08 -0800

It turned out suggested edits (trackable) don't show up for non-owners, so
I've just merged all the edits in place. It should be visible now.


On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> wrote:

> Oops. Let me try figure that out.
>
>
> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Thanks for picking up on this.
>>
>> Maybe I fail at google docs, but I can't see any edits on the document
>> you linked.
>>
>> Regarding lazy consensus, if the board in general has less of an issue
>> with that, sure.  As long as it is clearly announced, lasts at least
>> 72 hours, and has a clear outcome.
>>
>> The other points are hard to comment on without being able to see the
>> text in question.
>>
>>
>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote:
>> > I just looked through the entire thread again tonight - there are a lot
>> of
>> > great ideas being discussed. Thanks Cody for taking the first crack at
>> the
>> > proposal.
>> >
>> > I want to first comment on the context. Spark is one of the most
>> innovative
>> > and important projects in (big) data -- overall technical decisions
>> made in
>> > Apache Spark are sound. But of course, a project as large and active as
>> > Spark always have room for improvement, and we as a community should
>> strive
>> > to take it to the next level.
>> >
>> > To that end, the two biggest areas for improvements in my opinion are:
>> >
>> > 1. Visibility: There are so much happening that it is difficult to know
>> what
>> > really is going on. For people that don't follow closely, it is
>> difficult to
>> > know what the important initiatives are. Even for people that do
>> follow, it
>> > is difficult to know what specific things require their attention,
>> since the
>> > number of pull requests and JIRA tickets are high and it's difficult to
>> > extract signal from noise.
>> >
>> > 2. Solicit user (broadly defined, including developers themselves) input
>> > more proactively: At the end of the day the project provides value
>> because
>> > users use it. Users can't tell us exactly what to build, but it is
>> important
>> > to get their inputs.
>> >
>> >
>> > I've taken Cody's doc and edited it:
>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>> > (I've made all my modifications trackable)
>> >
>> > There are couple high level changes I made:
>> >
>> > 1. I've consulted a board member and he recommended lazy consensus as
>> > opposed to voting. The reason being in voting there can easily be a
>> "loser'
>> > that gets outvoted.
>> >
>> > 2. I made it lighter weight, and renamed "strategy" to "optional design
>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>> tagging
>> > things and linking them elsewhere simply having design docs and
>> prototypes
>> > implementations in PRs is not something that has not worked so far".
>> >
>> > 3. I made some the language tweaks to focus more on visibility. For
>> example,
>> > "The purpose of an SIP is to inform and involve", rather than just
>> > "involve". SIPs should also have at least two emails that go to dev@.
>> >
>> >
>> > While I was editing this, I thought we really needed a suggested
>> template
>> > for design doc too. I will get to that too ...
>> >
>> >
>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com>
>> wrote:
>> >>
>> >> Most things looked OK to me too, although I do plan to take a closer
>> look
>> >> after Nov 1st when we cut the release branch for 2.1.
>> >>
>> >>
>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> The proposal looks OK to me. I assume, even though it's not explicitly
>> >>> called, that voting would happen by e-mail? A template for the
>> >>> proposal document (instead of just a bullet nice) would also be nice,
>> >>> but that can be done at any time.
>> >>>
>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>> >>> for a SIP, given the scope of the work. The document attached even
>> >>> somewhat matches the proposed format. So if anyone wants to try out
>> >>> the process...
>> >>>
>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
>> >>> wrote:
>> >>> > Now that spark summit europe is over, are any committers interested
>> in
>> >>> > moving forward with this?
>> >>> >
>> >>> >
>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >>> >
>> >>> > Or are we going to let this discussion die on the vine?
>> >>> >
>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >>> > <tomasz.gaw...@outlook.com> wrote:
>> >>> >> Maybe my mail was not clear enough.
>> >>> >>
>> >>> >>
>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> framework.
>> >>> >> The
>> >>> >> idea with benchmarks was to show two things:
>> >>> >>
>> >>> >> - why some people are doing bad PR for Spark
>> >>> >>
>> >>> >> - how - in easy way - we can change it and show that Spark is
>> still on
>> >>> >> the
>> >>> >> top
>> >>> >>
>> >>> >>
>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> >>> >> they're the
>> >>> >> most important thing in Spark :) On the Spark main page there is
>> still
>> >>> >> chart
>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
>> the
>> >>> >> same
>> >>> >> Spark with other API, but much faster and optimized, comparable or
>> >>> >> even
>> >>> >> faster than other frameworks.
>> >>> >>
>> >>> >>
>> >>> >> About real-time streaming, I think it would be just good to see it
>> in
>> >>> >> Spark.
>> >>> >> I very like current Spark model, but many voices that says "we need
>> >>> >> more" -
>> >>> >> community should listen also them and try to help them. With SIPs
>> it
>> >>> >> would
>> >>> >> be easier, I've just posted this example as "thing that may be
>> changed
>> >>> >> with
>> >>> >> SIP".
>> >>> >>
>> >>> >>
>> >>> >> I very like unification via Datasets, but there is a lot of
>> algorithms
>> >>> >> inside - let's make easy API, but with strong background (articles,
>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
>> modern
>> >>> >> framework.
>> >>> >>
>> >>> >>
>> >>> >> Maybe now my intention will be clearer :) As I said organizational
>> >>> >> ideas
>> >>> >> were already mentioned and I agree with them, my mail was just to
>> show
>> >>> >> some
>> >>> >> aspects from my side, so from theside of developer and person who
>> is
>> >>> >> trying
>> >>> >> to help others with Spark (via StackOverflow or other ways)
>> >>> >>
>> >>> >>
>> >>> >> Pozdrawiam / Best regards,
>> >>> >>
>> >>> >> Tomasz
>> >>> >>
>> >>> >>
>> >>> >> ________________________________
>> >>> >> Od: Cody Koeninger <c...@koeninger.org>
>> >>> >> Wysłane: 17 października 2016 16:46
>> >>> >> Do: Debasish Das
>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>> >>> >> Temat: Re: Spark Improvement Proposals
>> >>> >>
>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
>> point.
>> >>> >>
>> >>> >> My point is evolve or die.  Spark's governance and organization is
>> >>> >> hampering its ability to evolve technologically, and it needs to
>> >>> >> change.
>> >>> >>
>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>> >>> >> <debasish.da...@gmail.com>
>> >>> >> wrote:
>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in
>> 2014
>> >>> >>> as
>> >>> >>> soon as I looked into it since compared to writing Java map-reduce
>> >>> >>> and
>> >>> >>> Cascading code, Spark made writing distributed code fun...But now
>> as
>> >>> >>> we
>> >>> >>> went
>> >>> >>> deeper with Spark and real-time streaming use-case gets more
>> >>> >>> prominent, I
>> >>> >>> think it is time to bring a messaging model in conjunction with
>> the
>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close
>> >>> >>> integration with spark micro-batching APIs looks like a great
>> >>> >>> direction to
>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>> streaming
>> >>> >>> with
>> >>> >>> batch with the assumption is that micro-batching is sufficient to
>> run
>> >>> >>> SQL
>> >>> >>> commands on stream but do we really have time to do SQL
>> processing at
>> >>> >>> streaming data within 1-2 seconds ?
>> >>> >>>
>> >>> >>> After reading the email chain, I started to look into Flink
>> >>> >>> documentation
>> >>> >>> and if you compare it with Spark documentation, I think we have
>> major
>> >>> >>> work
>> >>> >>> to do detailing out Spark internals so that more people from
>> >>> >>> community
>> >>> >>> start
>> >>> >>> to take active role in improving the issues so that Spark stays
>> >>> >>> strong
>> >>> >>> compared to Flink.
>> >>> >>>
>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>> >>> >>>
>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>> >>> >>>
>> >>> >>> Spark is no longer an engine that works for micro-batch and
>> >>> >>> batch...We
>> >>> >>> (and
>> >>> >>> I am sure many others) are pushing spark as an engine for stream
>> and
>> >>> >>> query
>> >>> >>> processing.....we need to make it a state-of-the-art engine for
>> high
>> >>> >>> speed
>> >>> >>> streaming data and user queries as well !
>> >>> >>>
>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>> >>> >>> <tomasz.gaw...@outlook.com>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Hi everyone,
>> >>> >>>>
>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
>> help a
>> >>> >>>> little bit. :) Many technical and organizational topics were
>> >>> >>>> mentioned,
>> >>> >>>> but I want to focus on these negative posts about Spark and about
>> >>> >>>> "haters"
>> >>> >>>>
>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
>> it's
>> >>> >>>> everything here. But Every project has to "flight" on "framework
>> >>> >>>> market"
>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>> communities,
>> >>> >>>> maybe my mail will inspire someone :)
>> >>> >>>>
>> >>> >>>> You (every Spark developer; so far I didn't have enough time to
>> join
>> >>> >>>> contributing to Spark) has done excellent job. So why are some
>> >>> >>>> people
>> >>> >>>> saying that Flink (or other framework) is better, like it was
>> posted
>> >>> >>>> in
>> >>> >>>> this mailing list? No, not because that framework is better in
>> all
>> >>> >>>> cases.. In my opinion, many of these discussions where started
>> after
>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink
>> vs
>> >>> >>>> ...."
>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
>> sometimes
>> >>> >>>> saying nothing about other frameworks, Flink's users (often
>> PMC's)
>> >>> >>>> are
>> >>> >>>> just posting same information about real-time streaming, about
>> delta
>> >>> >>>> iterations, etc. It look smart and very often it is marked as an
>> >>> >>>> aswer,
>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
>> perform
>> >>> >>>> huge
>> >>> >>>> performance test. Maybe some company, that supports Spark
>> >>> >>>> (Databricks,
>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
>> could
>> >>> >>>> perform performance test of:
>> >>> >>>>
>> >>> >>>> - streaming engine - probably Spark will loose because of
>> mini-batch
>> >>> >>>> model, however currently the difference should be much lower
>> that in
>> >>> >>>> previous versions
>> >>> >>>>
>> >>> >>>> - Machine Learning models
>> >>> >>>>
>> >>> >>>> - batch jobs
>> >>> >>>>
>> >>> >>>> - Graph jobs
>> >>> >>>>
>> >>> >>>> - SQL queries
>> >>> >>>>
>> >>> >>>> People will see that Spark is envolving and is also a modern
>> >>> >>>> framework,
>> >>> >>>> because after reading posts mentioned above people may think "it
>> is
>> >>> >>>> outdated, future is in framework X".
>> >>> >>>>
>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>> Structured
>> >>> >>>> Streaming beats every other framework in terms of easy-of-use and
>> >>> >>>> reliability. Performance tests, done in various environments (in
>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
>> >>> >>>> you're
>> >>> >>>> telling that you're better, but Spark is still faster and is
>> still
>> >>> >>>> getting even more fast!". This would be based on facts (just
>> >>> >>>> numbers),
>> >>> >>>> not opinions. It would be good for companies, for marketing
>> puproses
>> >>> >>>> and
>> >>> >>>> for every Spark developer
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> Second: real-time streaming. I've written some time ago about
>> >>> >>>> real-time
>> >>> >>>> streaming support in Spark Structured Streaming. Some work
>> should be
>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
>> Maybe
>> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka? I
>> >>> >>>> don't
>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
>> >>> >>>> should
>> >>> >>>> have real-time streaming support. Currently I see many
>> >>> >>>> posts/comments
>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very
>> good
>> >>> >>>> jobs with micro-batches, however I think it is possible to add
>> also
>> >>> >>>> more
>> >>> >>>> real-time processing.
>> >>> >>>>
>> >>> >>>> Other people said much more and I agree with proposal of SIP. I'm
>> >>> >>>> also
>> >>> >>>> happy that PMC's are not saying that they will not listen to
>> users,
>> >>> >>>> but
>> >>> >>>> they really want to make Spark better for every user.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> What do you think about these two topics? Especially I'm looking
>> at
>> >>> >>>> Cody
>> >>> >>>> (who has started this topic) and PMCs :)
>> >>> >>>>
>> >>> >>>> Pozdrawiam / Best regards,
>> >>> >>>>
>> >>> >>>> Tomasz
>> >>> >>>>
>> >>> >>>>
>> >>>
>> >>
>> >
>> >
>>
>

Re: Spark Improvement Proposals

Reply via email to