Re: Spark Improvement Proposals

Xiao Li Sat, 11 Feb 2017 09:14:22 -0800

During the summit, I also had a lot of discussions over similar topics with
multiple Committers and active users. I heard many fantastic ideas. I
believe Spark improvement proposals are good channels to collect the
requirements/designs.



IMO, we also need to consider the priority when working on these items.
Even if the proposal is accepted, it does not mean it will be implemented
and merged immediately. It is not a FIFO queue.


Even if some PRs are merged, sometimes, we still have to revert them back,
if the design and implementation are not reviewed carefully. We have to
ensure our quality. Spark is not an application software. It is an
infrastructure software that is being used by many many companies. We have
to be very careful in the design and implementation, especially
adding/changing the external APIs.


When I developed the Mainframe infrastructure/middleware software in the
past 6 years, I were involved in the discussions with external/internal
customers. The to-do feature list was always above 100. Sometimes, the
customers are feeling frustrated when we are unable to deliver them on time
due to the resource limits and others. Even if they paid us billions, we
still need to do it phase by phase or sometimes they have to accept the
workarounds. That is the reality everyone has to face, I think.


Thanks,


Xiao Li

2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:

> At the spark summit this week, everyone from PMC members to users I had
> never met before were asking me about the Spark improvement proposals
> idea.  It's clear that it's a real community need.
>
> But it's been almost half a year, and nothing visible has been done.
>
> Reynold, are you going to do this?
>
> If so, when?
>
> If not, why?
>
> You already did the right thing by including long-deserved committers.
> Please keep doing the right thing for the community.
>
> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> +1 on all counts (consensus, time bound, define roles)
>>
>> I can update the doc in the next few days and share back. Then maybe we
>> can just officially vote on this. As Tim suggested, we might not get it
>> 100% right the first time and would need to re-iterate. But that's fine.
>>
>>
>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
>> wrote:
>>
>>> Hi Cody,
>>> thank you for bringing up this topic, I agree it is very important to
>>> keep a cohesive community around some common, fluid goals. Here are a few
>>> comments about the current document:
>>>
>>> 1. name: it should not overlap with an existing one such as SIP. Can you
>>> imagine someone trying to discuss a scala spore proposal for spark?
>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>> sounds great.
>>>
>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>> technical decisions with a lasting impact. As such, the template should
>>> emphasize the role of the various parties during this process:
>>>
>>>  - the SPIP author is responsible for building consensus. She is the
>>> champion driving the process forward and is responsible for ensuring that
>>> the SPIP follows the general guidelines. The author should be identified in
>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>> is not interested and someone else wants to move the SPIP forward. There
>>> should probably be 2-3 authors at most for each SPIP.
>>>
>>>  - someone with voting power should probably shepherd the SPIP (and be
>>> recorded as such): ensuring that the final decision over the SPIP is
>>> recorded (rejected, accepted, etc.), and advising about the technical
>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>> contribute to it, but rather makes sure it stands a chance of being
>>> approved when the vote happens. Also, if the author cannot find anyone who
>>> would want to take this role, this proposal is likely to be rejected anyway.
>>>
>>>  - users, committers, contributors have the roles already outlined in
>>> the document
>>>
>>> 3. timeline: ideally, once a SPIP has been offered for voting, it should
>>> move swiftly into either being accepted or rejected, so that we do not end
>>> up with a distracting long tail of half-hearted proposals.
>>>
>>> These rules are meant to be flexible, but the current document should be
>>> clear about who is in charge of a SPIP, and the state it is currently in.
>>>
>>> We have had long discussions over some very important questions such as
>>> approval. I do not have an opinion on these, but why not make a pick and
>>> reevaluate this decision later? This is not a binding process at this point.
>>>
>>> Tim
>>>
>>>
>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> I don't have a concern about voting vs consensus.
>>>>
>>>> I have a concern that whatever the decision making process is, it is
>>>> explicitly announced on the ticket for the given proposal, with an explicit
>>>> deadline, and an explicit outcome.
>>>>
>>>>
>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com>
>>>> wrote:
>>>>
>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>
>>>>> My take on the specific issues Joseph mentioned:
>>>>>
>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>>> earlier for consensus:
>>>>>
>>>>> > Majority vs consensus: My rationale is that I don't think we want to
>>>>> consider a proposal approved if it had objections serious enough that
>>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>> community effort and I wouldn't want to move forward if up to half of the
>>>>> community thinks it's an untenable idea.
>>>>>
>>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>>
>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>
>>>>> One small addition:
>>>>>
>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>>>> no one has objected.  (don't care enough that I'd object to anything else,
>>>>> though.)
>>>>>
>>>>>
>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jos...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Cody,
>>>>>>
>>>>>> Thanks for being persistent about this.  I too would like to see this
>>>>>> happen.  Reviewing the thread, it sounds like the main things remaining 
>>>>>> are:
>>>>>> * Decide about a few issues
>>>>>> * Finalize the doc(s)
>>>>>> * Vote on this proposal
>>>>>>
>>>>>> Issues & TODOs:
>>>>>>
>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>> little preference here.  It sounds like something which could be tailored
>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>
>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>> regardless of this SIP discussion.)
>>>>>> * Reynold, are you still putting this together?
>>>>>>
>>>>>> (3) Template cleanups.  Listing some items mentioned above + a new
>>>>>> one w.r.t. Reynold's draft
>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>> :
>>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>>> * Add field for stating explicit deadlines for approval
>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>
>>>>>> Thanks all!
>>>>>> Joseph
>>>>>>
>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <c...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm bumping this one more time for the new year, and then I'm giving
>>>>>>> up.
>>>>>>>
>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>> suggested.
>>>>>>>
>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com>
>>>>>>> wrote:
>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>> >
>>>>>>> > First, why lazy consensus? The proposal was for consensus, which
>>>>>>> is at least
>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>> requires
>>>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>>>> what we
>>>>>>> > want to achieve with these proposals?
>>>>>>> >
>>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>>> votes. Why
>>>>>>> > would we not want at least three committers to think something is
>>>>>>> a good
>>>>>>> > idea before adopting the proposal?
>>>>>>> >
>>>>>>> > rb
>>>>>>> >
>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> So there are some minor things (the Where section heading appears
>>>>>>> to
>>>>>>> >> be dropped; wherever this document is posted it needs to actually
>>>>>>> link
>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>>>>> like
>>>>>>> >> I can comment on the google doc.
>>>>>>> >>
>>>>>>> >> The major substantive issue that I have is that this version is
>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>> >>
>>>>>>> >> The apache example of lazy consensus at
>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves
>>>>>>> an
>>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>>> >> necessary for clarity.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com>
>>>>>>> wrote:
>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>> non-owners,
>>>>>>> >> > so
>>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>>> now.
>>>>>>> >> >
>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>> r...@databricks.com>
>>>>>>> >> > wrote:
>>>>>>> >> >>
>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>> c...@koeninger.org> wrote:
>>>>>>> >> >>>
>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>> >> >>>
>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>>>>>> document
>>>>>>> >> >>> you linked.
>>>>>>> >> >>>
>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less of
>>>>>>> an issue
>>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts
>>>>>>> at least
>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>> >> >>>
>>>>>>> >> >>> The other points are hard to comment on without being able to
>>>>>>> see the
>>>>>>> >> >>> text in question.
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>> r...@databricks.com>
>>>>>>> >> >>> wrote:
>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>> there are a
>>>>>>> >> >>> > lot
>>>>>>> >> >>> > of
>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>> first crack
>>>>>>> >> >>> > at
>>>>>>> >> >>> > the
>>>>>>> >> >>> > proposal.
>>>>>>> >> >>> >
>>>>>>> >> >>> > I want to first comment on the context. Spark is one of the
>>>>>>> most
>>>>>>> >> >>> > innovative
>>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>>> decisions
>>>>>>> >> >>> > made in
>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>>> and active
>>>>>>> >> >>> > as
>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>> community should
>>>>>>> >> >>> > strive
>>>>>>> >> >>> > to take it to the next level.
>>>>>>> >> >>> >
>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>>> opinion
>>>>>>> >> >>> > are:
>>>>>>> >> >>> >
>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>> difficult to
>>>>>>> >> >>> > know
>>>>>>> >> >>> > what
>>>>>>> >> >>> > really is going on. For people that don't follow closely,
>>>>>>> it is
>>>>>>> >> >>> > difficult to
>>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>>> that do
>>>>>>> >> >>> > follow, it
>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>> attention,
>>>>>>> >> >>> > since the
>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>>>> difficult
>>>>>>> >> >>> > to
>>>>>>> >> >>> > extract signal from noise.
>>>>>>> >> >>> >
>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>> themselves)
>>>>>>> >> >>> > input
>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>> provides value
>>>>>>> >> >>> > because
>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build,
>>>>>>> but it is
>>>>>>> >> >>> > important
>>>>>>> >> >>> > to get their inputs.
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>> ng=h.36ut37zh7w2b
>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>> >> >>> >
>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>> >> >>> >
>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>>> consensus
>>>>>>> >> >>> > as
>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>> easily be a
>>>>>>> >> >>> > "loser'
>>>>>>> >> >>> > that gets outvoted.
>>>>>>> >> >>> >
>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>> "optional
>>>>>>> >> >>> > design
>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>> aside from
>>>>>>> >> >>> > tagging
>>>>>>> >> >>> > things and linking them elsewhere simply having design docs
>>>>>>> and
>>>>>>> >> >>> > prototypes
>>>>>>> >> >>> > implementations in PRs is not something that has not worked
>>>>>>> so far".
>>>>>>> >> >>> >
>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>> visibility. For
>>>>>>> >> >>> > example,
>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>>> than just
>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that
>>>>>>> go to
>>>>>>> >> >>> > dev@.
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>> suggested
>>>>>>> >> >>> > template
>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>> r...@databricks.com>
>>>>>>> >> >>> > wrote:
>>>>>>> >> >>> >>
>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>> take a
>>>>>>> >> >>> >> closer
>>>>>>> >> >>> >> look
>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>> >> >>> >>
>>>>>>> >> >>> >>
>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>> >> >>> >> <van...@cloudera.com>
>>>>>>> >> >>> >> wrote:
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's
>>>>>>> not
>>>>>>> >> >>> >>> explicitly
>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template
>>>>>>> for the
>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>>> also be
>>>>>>> >> >>> >>> nice,
>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider
>>>>>>> a
>>>>>>> >> >>> >>> candidate
>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>> attached even
>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants
>>>>>>> to try
>>>>>>> >> >>> >>> out
>>>>>>> >> >>> >>> the process...
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>> >> >>> >>> <c...@koeninger.org>
>>>>>>> >> >>> >>> wrote:
>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>>>>> >> >>> >>> > interested
>>>>>>> >> >>> >>> > in
>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote:
>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>>> other
>>>>>>> >> >>> >>> >> framework.
>>>>>>> >> >>> >>> >> The
>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>>> Spark is
>>>>>>> >> >>> >>> >> still on
>>>>>>> >> >>> >>> >> the
>>>>>>> >> >>> >>> >> top
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>>> don't think
>>>>>>> >> >>> >>> >> they're the
>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>> page there
>>>>>>> >> >>> >>> >> is
>>>>>>> >> >>> >>> >> still
>>>>>>> >> >>> >>> >> chart
>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>> framework is
>>>>>>> >> >>> >>> >> not
>>>>>>> >> >>> >>> >> the
>>>>>>> >> >>> >>> >> same
>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>>> comparable
>>>>>>> >> >>> >>> >> or
>>>>>>> >> >>> >>> >> even
>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>>> good to see
>>>>>>> >> >>> >>> >> it
>>>>>>> >> >>> >>> >> in
>>>>>>> >> >>> >>> >> Spark.
>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>>>> says "we
>>>>>>> >> >>> >>> >> need
>>>>>>> >> >>> >>> >> more" -
>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>> them. With
>>>>>>> >> >>> >>> >> SIPs
>>>>>>> >> >>> >>> >> it
>>>>>>> >> >>> >>> >> would
>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>> that may be
>>>>>>> >> >>> >>> >> changed
>>>>>>> >> >>> >>> >> with
>>>>>>> >> >>> >>> >> SIP".
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>>> lot of
>>>>>>> >> >>> >>> >> algorithms
>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>> background
>>>>>>> >> >>> >>> >> (articles,
>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark
>>>>>>> is still
>>>>>>> >> >>> >>> >> modern
>>>>>>> >> >>> >>> >> framework.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>> >> >>> >>> >> organizational
>>>>>>> >> >>> >>> >> ideas
>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>>>> was just
>>>>>>> >> >>> >>> >> to
>>>>>>> >> >>> >>> >> show
>>>>>>> >> >>> >>> >> some
>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>>>>>> person
>>>>>>> >> >>> >>> >> who
>>>>>>> >> >>> >>> >> is
>>>>>>> >> >>> >>> >> trying
>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>>>> ways)
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> Tomasz
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> ________________________________
>>>>>>> >> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org>
>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>>> missing my
>>>>>>> >> >>> >>> >> point.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>> organization
>>>>>>> >> >>> >>> >> is
>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and
>>>>>>> it needs
>>>>>>> >> >>> >>> >> to
>>>>>>> >> >>> >>> >> change.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>> >> >>> >>> >> <debasish.da...@gmail.com>
>>>>>>> >> >>> >>> >> wrote:
>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked
>>>>>>> up Spark
>>>>>>> >> >>> >>> >>> in
>>>>>>> >> >>> >>> >>> 2014
>>>>>>> >> >>> >>> >>> as
>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing
>>>>>>> Java
>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>> >> >>> >>> >>> and
>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>>> fun...But
>>>>>>> >> >>> >>> >>> now
>>>>>>> >> >>> >>> >>> as
>>>>>>> >> >>> >>> >>> we
>>>>>>> >> >>> >>> >>> went
>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>>> gets more
>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>> conjunction
>>>>>>> >> >>> >>> >>> with
>>>>>>> >> >>> >>> >>> the
>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>> at....akka-streams
>>>>>>> >> >>> >>> >>> close
>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like
>>>>>>> a great
>>>>>>> >> >>> >>> >>> direction to
>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>> integrated
>>>>>>> >> >>> >>> >>> streaming
>>>>>>> >> >>> >>> >>> with
>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>>> sufficient
>>>>>>> >> >>> >>> >>> to
>>>>>>> >> >>> >>> >>> run
>>>>>>> >> >>> >>> >>> SQL
>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do
>>>>>>> SQL
>>>>>>> >> >>> >>> >>> processing at
>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>>>>>> Flink
>>>>>>> >> >>> >>> >>> documentation
>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>> think we
>>>>>>> >> >>> >>> >>> have
>>>>>>> >> >>> >>> >>> major
>>>>>>> >> >>> >>> >>> work
>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>> people from
>>>>>>> >> >>> >>> >>> community
>>>>>>> >> >>> >>> >>> start
>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>>> Spark
>>>>>>> >> >>> >>> >>> stays
>>>>>>> >> >>> >>> >>> strong
>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>> micro-batch and
>>>>>>> >> >>> >>> >>> batch...We
>>>>>>> >> >>> >>> >>> (and
>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine
>>>>>>> for
>>>>>>> >> >>> >>> >>> stream
>>>>>>> >> >>> >>> >>> and
>>>>>>> >> >>> >>> >>> query
>>>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>>>> engine
>>>>>>> >> >>> >>> >>> for
>>>>>>> >> >>> >>> >>> high
>>>>>>> >> >>> >>> >>> speed
>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>> >> >>> >>> >>> <tomasz.gaw...@outlook.com>
>>>>>>> >> >>> >>> >>> wrote:
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>> suggestions may
>>>>>>> >> >>> >>> >>>> help a
>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>> topics were
>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>>> Spark and
>>>>>>> >> >>> >>> >>>> about
>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>>> community
>>>>>>> >> >>> >>> >>>> -
>>>>>>> >> >>> >>> >>>> it's
>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>>>>>> >> >>> >>> >>>> "framework
>>>>>>> >> >>> >>> >>>> market"
>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>>> Data
>>>>>>> >> >>> >>> >>>> communities,
>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>> enough time
>>>>>>> >> >>> >>> >>>> to
>>>>>>> >> >>> >>> >>>> join
>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So
>>>>>>> why are
>>>>>>> >> >>> >>> >>>> some
>>>>>>> >> >>> >>> >>>> people
>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>>> like it was
>>>>>>> >> >>> >>> >>>> posted
>>>>>>> >> >>> >>> >>>> in
>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>>>>>> better
>>>>>>> >> >>> >>> >>>> in
>>>>>>> >> >>> >>> >>>> all
>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>> where
>>>>>>> >> >>> >>> >>>> started
>>>>>>> >> >>> >>> >>>> after
>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>> StackOverflow
>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>> >> >>> >>> >>>> vs
>>>>>>> >> >>> >>> >>>> ...."
>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>> Answers are
>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>>>>>> (often
>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>> >> >>> >>> >>>> are
>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>> streaming,
>>>>>>> >> >>> >>> >>>> about
>>>>>>> >> >>> >>> >>>> delta
>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>>>> marked as
>>>>>>> >> >>> >>> >>>> an
>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>>>> truth.
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>> knowledgle to
>>>>>>> >> >>> >>> >>>> perform
>>>>>>> >> >>> >>> >>>> huge
>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>>>> Spark
>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>> community :) )
>>>>>>> >> >>> >>> >>>> could
>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>> because of
>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>>> much lower
>>>>>>> >> >>> >>> >>>> that in
>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also
>>>>>>> a modern
>>>>>>> >> >>> >>> >>>> framework,
>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>>> may think
>>>>>>> >> >>> >>> >>>> "it
>>>>>>> >> >>> >>> >>>> is
>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>>> Spark
>>>>>>> >> >>> >>> >>>> Structured
>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>>> easy-of-use
>>>>>>> >> >>> >>> >>>> and
>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>> environments
>>>>>>> >> >>> >>> >>>> (in
>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>> cluster,
>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to
>>>>>>> say
>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>> >> >>> >>> >>>> you're
>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>> faster and is
>>>>>>> >> >>> >>> >>>> still
>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>> facts (just
>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>>> marketing
>>>>>>> >> >>> >>> >>>> puproses
>>>>>>> >> >>> >>> >>>> and
>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>>>> ago about
>>>>>>> >> >>> >>> >>>> real-time
>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>> Some work
>>>>>>> >> >>> >>> >>>> should be
>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>>>> possible.
>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>>> top of
>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>> >> >>> >>> >>>> I
>>>>>>> >> >>> >>> >>>> don't
>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>>>> that
>>>>>>> >> >>> >>> >>>> Spark
>>>>>>> >> >>> >>> >>>> should
>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see
>>>>>>> many
>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>>>>>> doing
>>>>>>> >> >>> >>> >>>> very
>>>>>>> >> >>> >>> >>>> good
>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>> possible to
>>>>>>> >> >>> >>> >>>> add
>>>>>>> >> >>> >>> >>>> also
>>>>>>> >> >>> >>> >>>> more
>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>> proposal of SIP.
>>>>>>> >> >>> >>> >>>> I'm
>>>>>>> >> >>> >>> >>>> also
>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>>> listen to
>>>>>>> >> >>> >>> >>>> users,
>>>>>>> >> >>> >>> >>>> but
>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> What do you think about these two topics? Especially
>>>>>>> I'm
>>>>>>> >> >>> >>> >>>> looking
>>>>>>> >> >>> >>> >>>> at
>>>>>>> >> >>> >>> >>>> Cody
>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Ryan Blue
>>>>>>> > Software Engineer
>>>>>>> > Netflix
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Joseph Bradley
>>>>>>
>>>>>> Software Engineer - Machine Learning
>>>>>>
>>>>>> Databricks, Inc.
>>>>>>
>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Reply via email to