Re: Spark Improvement Proposals

Cody Koeninger Tue, 14 Feb 2017 17:54:12 -0800

Thanks for doing that.

Given that there are at least 4 different Apache voting processes, "typical
Apache vote process" isn't meaningful to me.


I think the intention is that in order to pass, it needs at least 3 +1
votes from PMC members *and no -1 votes from PMC members*.  But the
document doesn't explicitly say that second part.

There's also no mention of the duration a vote should remain open.  There's
a mention of a month for finding a shepherd, but that's different.

Other than that, LGTM.

On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <[email protected]> wrote:

> Here's a new draft that incorporated most of the feedback:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
>
> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>
> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <[email protected]> wrote:
>
>> During the summit, I also had a lot of discussions over similar topics
>> with multiple Committers and active users. I heard many fantastic ideas. I
>> believe Spark improvement proposals are good channels to collect the
>> requirements/designs.
>>
>>
>> IMO, we also need to consider the priority when working on these items.
>> Even if the proposal is accepted, it does not mean it will be implemented
>> and merged immediately. It is not a FIFO queue.
>>
>>
>> Even if some PRs are merged, sometimes, we still have to revert them
>> back, if the design and implementation are not reviewed carefully. We have
>> to ensure our quality. Spark is not an application software. It is an
>> infrastructure software that is being used by many many companies. We have
>> to be very careful in the design and implementation, especially
>> adding/changing the external APIs.
>>
>>
>> When I developed the Mainframe infrastructure/middleware software in the
>> past 6 years, I were involved in the discussions with external/internal
>> customers. The to-do feature list was always above 100. Sometimes, the
>> customers are feeling frustrated when we are unable to deliver them on time
>> due to the resource limits and others. Even if they paid us billions, we
>> still need to do it phase by phase or sometimes they have to accept the
>> workarounds. That is the reality everyone has to face, I think.
>>
>>
>> Thanks,
>>
>>
>> Xiao Li
>>
>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <[email protected]>:
>>
>>> At the spark summit this week, everyone from PMC members to users I had
>>> never met before were asking me about the Spark improvement proposals
>>> idea.  It's clear that it's a real community need.
>>>
>>> But it's been almost half a year, and nothing visible has been done.
>>>
>>> Reynold, are you going to do this?
>>>
>>> If so, when?
>>>
>>> If not, why?
>>>
>>> You already did the right thing by including long-deserved committers.
>>> Please keep doing the right thing for the community.
>>>
>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <[email protected]>
>>> wrote:
>>>
>>>> +1 on all counts (consensus, time bound, define roles)
>>>>
>>>> I can update the doc in the next few days and share back. Then maybe we
>>>> can just officially vote on this. As Tim suggested, we might not get it
>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>
>>>>
>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Cody,
>>>>> thank you for bringing up this topic, I agree it is very important to
>>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>>> comments about the current document:
>>>>>
>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>>> sounds great.
>>>>>
>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>>> technical decisions with a lasting impact. As such, the template should
>>>>> emphasize the role of the various parties during this process:
>>>>>
>>>>>  - the SPIP author is responsible for building consensus. She is the
>>>>> champion driving the process forward and is responsible for ensuring that
>>>>> the SPIP follows the general guidelines. The author should be identified 
>>>>> in
>>>>> the SPIP. The authorship of a SPIP can be transferred if the current 
>>>>> author
>>>>> is not interested and someone else wants to move the SPIP forward. There
>>>>> should probably be 2-3 authors at most for each SPIP.
>>>>>
>>>>>  - someone with voting power should probably shepherd the SPIP (and be
>>>>> recorded as such): ensuring that the final decision over the SPIP is
>>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>>> contribute to it, but rather makes sure it stands a chance of being
>>>>> approved when the vote happens. Also, if the author cannot find anyone who
>>>>> would want to take this role, this proposal is likely to be rejected 
>>>>> anyway.
>>>>>
>>>>>  - users, committers, contributors have the roles already outlined in
>>>>> the document
>>>>>
>>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it
>>>>> should move swiftly into either being accepted or rejected, so that we do
>>>>> not end up with a distracting long tail of half-hearted proposals.
>>>>>
>>>>> These rules are meant to be flexible, but the current document should
>>>>> be clear about who is in charge of a SPIP, and the state it is currently 
>>>>> in.
>>>>>
>>>>> We have had long discussions over some very important questions such
>>>>> as approval. I do not have an opinion on these, but why not make a pick 
>>>>> and
>>>>> reevaluate this decision later? This is not a binding process at this 
>>>>> point.
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I don't have a concern about voting vs consensus.
>>>>>>
>>>>>> I have a concern that whatever the decision making process is, it is
>>>>>> explicitly announced on the ticket for the given proposal, with an 
>>>>>> explicit
>>>>>> deadline, and an explicit outcome.
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>>>
>>>>>>> My take on the specific issues Joseph mentioned:
>>>>>>>
>>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>>>>> earlier for consensus:
>>>>>>>
>>>>>>> > Majority vs consensus: My rationale is that I don't think we want
>>>>>>> to consider a proposal approved if it had objections serious enough that
>>>>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>>>> community effort and I wouldn't want to move forward if up to half of 
>>>>>>> the
>>>>>>> community thinks it's an untenable idea.
>>>>>>>
>>>>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>>>>
>>>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>>>
>>>>>>> One small addition:
>>>>>>>
>>>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At 
>>>>>>> least,
>>>>>>> no one has objected.  (don't care enough that I'd object to anything 
>>>>>>> else,
>>>>>>> though.)
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Cody,
>>>>>>>>
>>>>>>>> Thanks for being persistent about this.  I too would like to see
>>>>>>>> this happen.  Reviewing the thread, it sounds like the main things
>>>>>>>> remaining are:
>>>>>>>> * Decide about a few issues
>>>>>>>> * Finalize the doc(s)
>>>>>>>> * Vote on this proposal
>>>>>>>>
>>>>>>>> Issues & TODOs:
>>>>>>>>
>>>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>>>> little preference here.  It sounds like something which could be 
>>>>>>>> tailored
>>>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>>>
>>>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>>>> regardless of this SIP discussion.)
>>>>>>>> * Reynold, are you still putting this together?
>>>>>>>>
>>>>>>>> (3) Template cleanups.  Listing some items mentioned above + a new
>>>>>>>> one w.r.t. Reynold's draft
>>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>>>> :
>>>>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>>>>> * Add field for stating explicit deadlines for approval
>>>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>>>
>>>>>>>> Thanks all!
>>>>>>>> Joseph
>>>>>>>>
>>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm bumping this one more time for the new year, and then I'm
>>>>>>>>> giving up.
>>>>>>>>>
>>>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>>>> suggested.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>>>> >
>>>>>>>>> > First, why lazy consensus? The proposal was for consensus, which
>>>>>>>>> is at least
>>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>>>> requires
>>>>>>>>> > getting to a point where there is agreement. Isn't that
>>>>>>>>> agreement what we
>>>>>>>>> > want to achieve with these proposals?
>>>>>>>>> >
>>>>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>>>>> votes. Why
>>>>>>>>> > would we not want at least three committers to think something
>>>>>>>>> is a good
>>>>>>>>> > idea before adopting the proposal?
>>>>>>>>> >
>>>>>>>>> > rb
>>>>>>>>> >
>>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>
>>>>>>>>> >> So there are some minor things (the Where section heading
>>>>>>>>> appears to
>>>>>>>>> >> be dropped; wherever this document is posted it needs to
>>>>>>>>> actually link
>>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't
>>>>>>>>> look like
>>>>>>>>> >> I can comment on the google doc.
>>>>>>>>> >>
>>>>>>>>> >> The major substantive issue that I have is that this version is
>>>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>>>> >>
>>>>>>>>> >> The apache example of lazy consensus at
>>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus
>>>>>>>>> involves an
>>>>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>>>>> >> necessary for clarity.
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>>>> non-owners,
>>>>>>>>> >> > so
>>>>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>>>>> now.
>>>>>>>>> >> >
>>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>>>> [email protected]>
>>>>>>>>> >> > wrote:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on
>>>>>>>>> the document
>>>>>>>>> >> >>> you linked.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less
>>>>>>>>> of an issue
>>>>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts
>>>>>>>>> at least
>>>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> The other points are hard to comment on without being able
>>>>>>>>> to see the
>>>>>>>>> >> >>> text in question.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>>>> [email protected]>
>>>>>>>>> >> >>> wrote:
>>>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>>>> there are a
>>>>>>>>> >> >>> > lot
>>>>>>>>> >> >>> > of
>>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>>>> first crack
>>>>>>>>> >> >>> > at
>>>>>>>>> >> >>> > the
>>>>>>>>> >> >>> > proposal.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of
>>>>>>>>> the most
>>>>>>>>> >> >>> > innovative
>>>>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>>>>> decisions
>>>>>>>>> >> >>> > made in
>>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>>>>> and active
>>>>>>>>> >> >>> > as
>>>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>>>> community should
>>>>>>>>> >> >>> > strive
>>>>>>>>> >> >>> > to take it to the next level.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>>>>> opinion
>>>>>>>>> >> >>> > are:
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>>>> difficult to
>>>>>>>>> >> >>> > know
>>>>>>>>> >> >>> > what
>>>>>>>>> >> >>> > really is going on. For people that don't follow closely,
>>>>>>>>> it is
>>>>>>>>> >> >>> > difficult to
>>>>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>>>>> that do
>>>>>>>>> >> >>> > follow, it
>>>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>>>> attention,
>>>>>>>>> >> >>> > since the
>>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and
>>>>>>>>> it's difficult
>>>>>>>>> >> >>> > to
>>>>>>>>> >> >>> > extract signal from noise.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>>>> themselves)
>>>>>>>>> >> >>> > input
>>>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>>>> provides value
>>>>>>>>> >> >>> > because
>>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build,
>>>>>>>>> but it is
>>>>>>>>> >> >>> > important
>>>>>>>>> >> >>> > to get their inputs.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>>>> ng=h.36ut37zh7w2b
>>>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>>>>> consensus
>>>>>>>>> >> >>> > as
>>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>>>> easily be a
>>>>>>>>> >> >>> > "loser'
>>>>>>>>> >> >>> > that gets outvoted.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>>>> "optional
>>>>>>>>> >> >>> > design
>>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>>>> aside from
>>>>>>>>> >> >>> > tagging
>>>>>>>>> >> >>> > things and linking them elsewhere simply having design
>>>>>>>>> docs and
>>>>>>>>> >> >>> > prototypes
>>>>>>>>> >> >>> > implementations in PRs is not something that has not
>>>>>>>>> worked so far".
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>>>> visibility. For
>>>>>>>>> >> >>> > example,
>>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>>>>> than just
>>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that
>>>>>>>>> go to
>>>>>>>>> >> >>> > dev@.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>>>> suggested
>>>>>>>>> >> >>> > template
>>>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>>>> [email protected]>
>>>>>>>>> >> >>> > wrote:
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>>>> take a
>>>>>>>>> >> >>> >> closer
>>>>>>>>> >> >>> >> look
>>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>>>> >> >>> >> <[email protected]>
>>>>>>>>> >> >>> >> wrote:
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's
>>>>>>>>> not
>>>>>>>>> >> >>> >>> explicitly
>>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template
>>>>>>>>> for the
>>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>>>>> also be
>>>>>>>>> >> >>> >>> nice,
>>>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I
>>>>>>>>> consider a
>>>>>>>>> >> >>> >>> candidate
>>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>>>> attached even
>>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone
>>>>>>>>> wants to try
>>>>>>>>> >> >>> >>> out
>>>>>>>>> >> >>> >>> the process...
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>>>> >> >>> >>> <[email protected]>
>>>>>>>>> >> >>> >>> wrote:
>>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any
>>>>>>>>> committers
>>>>>>>>> >> >>> >>> > interested
>>>>>>>>> >> >>> >>> > in
>>>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the
>>>>>>>>> vine?
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>>>> >> >>> >>> > <[email protected]> wrote:
>>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>>>>> other
>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>> >> >>> >>> >> The
>>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>>>>> Spark is
>>>>>>>>> >> >>> >>> >> still on
>>>>>>>>> >> >>> >>> >> the
>>>>>>>>> >> >>> >>> >> top
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>>>>> don't think
>>>>>>>>> >> >>> >>> >> they're the
>>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>>>> page there
>>>>>>>>> >> >>> >>> >> is
>>>>>>>>> >> >>> >>> >> still
>>>>>>>>> >> >>> >>> >> chart
>>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>>>> framework is
>>>>>>>>> >> >>> >>> >> not
>>>>>>>>> >> >>> >>> >> the
>>>>>>>>> >> >>> >>> >> same
>>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>>>>> comparable
>>>>>>>>> >> >>> >>> >> or
>>>>>>>>> >> >>> >>> >> even
>>>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>>>>> good to see
>>>>>>>>> >> >>> >>> >> it
>>>>>>>>> >> >>> >>> >> in
>>>>>>>>> >> >>> >>> >> Spark.
>>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices
>>>>>>>>> that says "we
>>>>>>>>> >> >>> >>> >> need
>>>>>>>>> >> >>> >>> >> more" -
>>>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>>>> them. With
>>>>>>>>> >> >>> >>> >> SIPs
>>>>>>>>> >> >>> >>> >> it
>>>>>>>>> >> >>> >>> >> would
>>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>>>> that may be
>>>>>>>>> >> >>> >>> >> changed
>>>>>>>>> >> >>> >>> >> with
>>>>>>>>> >> >>> >>> >> SIP".
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>>>>> lot of
>>>>>>>>> >> >>> >>> >> algorithms
>>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>>>> background
>>>>>>>>> >> >>> >>> >> (articles,
>>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark
>>>>>>>>> is still
>>>>>>>>> >> >>> >>> >> modern
>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>>>> >> >>> >>> >> organizational
>>>>>>>>> >> >>> >>> >> ideas
>>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my
>>>>>>>>> mail was just
>>>>>>>>> >> >>> >>> >> to
>>>>>>>>> >> >>> >>> >> show
>>>>>>>>> >> >>> >>> >> some
>>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer
>>>>>>>>> and person
>>>>>>>>> >> >>> >>> >> who
>>>>>>>>> >> >>> >>> >> is
>>>>>>>>> >> >>> >>> >> trying
>>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or
>>>>>>>>> other ways)
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> Tomasz
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> ________________________________
>>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <[email protected]>
>>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; [email protected]
>>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>>>>> missing my
>>>>>>>>> >> >>> >>> >> point.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>>>> organization
>>>>>>>>> >> >>> >>> >> is
>>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and
>>>>>>>>> it needs
>>>>>>>>> >> >>> >>> >> to
>>>>>>>>> >> >>> >>> >> change.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>>>> >> >>> >>> >> <[email protected]>
>>>>>>>>> >> >>> >>> >> wrote:
>>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I
>>>>>>>>> picked up Spark
>>>>>>>>> >> >>> >>> >>> in
>>>>>>>>> >> >>> >>> >>> 2014
>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing
>>>>>>>>> Java
>>>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>>>>> fun...But
>>>>>>>>> >> >>> >>> >>> now
>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>> >> >>> >>> >>> we
>>>>>>>>> >> >>> >>> >>> went
>>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>>>>> gets more
>>>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>>>> conjunction
>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>> >> >>> >>> >>> the
>>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>>>> at....akka-streams
>>>>>>>>> >> >>> >>> >>> close
>>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks
>>>>>>>>> like a great
>>>>>>>>> >> >>> >>> >>> direction to
>>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>>>> integrated
>>>>>>>>> >> >>> >>> >>> streaming
>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>>>>> sufficient
>>>>>>>>> >> >>> >>> >>> to
>>>>>>>>> >> >>> >>> >>> run
>>>>>>>>> >> >>> >>> >>> SQL
>>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do
>>>>>>>>> SQL
>>>>>>>>> >> >>> >>> >>> processing at
>>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look
>>>>>>>>> into Flink
>>>>>>>>> >> >>> >>> >>> documentation
>>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>>>> think we
>>>>>>>>> >> >>> >>> >>> have
>>>>>>>>> >> >>> >>> >>> major
>>>>>>>>> >> >>> >>> >>> work
>>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>>>> people from
>>>>>>>>> >> >>> >>> >>> community
>>>>>>>>> >> >>> >>> >>> start
>>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>>>>> Spark
>>>>>>>>> >> >>> >>> >>> stays
>>>>>>>>> >> >>> >>> >>> strong
>>>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>>>> micro-batch and
>>>>>>>>> >> >>> >>> >>> batch...We
>>>>>>>>> >> >>> >>> >>> (and
>>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an
>>>>>>>>> engine for
>>>>>>>>> >> >>> >>> >>> stream
>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>> >> >>> >>> >>> query
>>>>>>>>> >> >>> >>> >>> processing.....we need to make it a
>>>>>>>>> state-of-the-art engine
>>>>>>>>> >> >>> >>> >>> for
>>>>>>>>> >> >>> >>> >>> high
>>>>>>>>> >> >>> >>> >>> speed
>>>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>>>> >> >>> >>> >>> <[email protected]>
>>>>>>>>> >> >>> >>> >>> wrote:
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>>>> suggestions may
>>>>>>>>> >> >>> >>> >>>> help a
>>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>>>> topics were
>>>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>>>>> Spark and
>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>>>>> community
>>>>>>>>> >> >>> >>> >>>> -
>>>>>>>>> >> >>> >>> >>>> it's
>>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight"
>>>>>>>>> on
>>>>>>>>> >> >>> >>> >>>> "framework
>>>>>>>>> >> >>> >>> >>>> market"
>>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>>>>> Data
>>>>>>>>> >> >>> >>> >>>> communities,
>>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>>>> enough time
>>>>>>>>> >> >>> >>> >>>> to
>>>>>>>>> >> >>> >>> >>>> join
>>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So
>>>>>>>>> why are
>>>>>>>>> >> >>> >>> >>>> some
>>>>>>>>> >> >>> >>> >>>> people
>>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>>>>> like it was
>>>>>>>>> >> >>> >>> >>>> posted
>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework
>>>>>>>>> is better
>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>> >> >>> >>> >>>> all
>>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>>>> where
>>>>>>>>> >> >>> >>> >>>> started
>>>>>>>>> >> >>> >>> >>>> after
>>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>>>> StackOverflow
>>>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>>>> >> >>> >>> >>>> vs
>>>>>>>>> >> >>> >>> >>>> ...."
>>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>>>> Answers are
>>>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's
>>>>>>>>> users (often
>>>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>>>> >> >>> >>> >>>> are
>>>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>>>> streaming,
>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>> >> >>> >>> >>>> delta
>>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it
>>>>>>>>> is marked as
>>>>>>>>> >> >>> >>> >>>> an
>>>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all
>>>>>>>>> the truth.
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>>>> knowledgle to
>>>>>>>>> >> >>> >>> >>>> perform
>>>>>>>>> >> >>> >>> >>>> huge
>>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that
>>>>>>>>> supports Spark
>>>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>>>> community :) )
>>>>>>>>> >> >>> >>> >>>> could
>>>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>>>> because of
>>>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>>>>> much lower
>>>>>>>>> >> >>> >>> >>>> that in
>>>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is
>>>>>>>>> also a modern
>>>>>>>>> >> >>> >>> >>>> framework,
>>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>>>>> may think
>>>>>>>>> >> >>> >>> >>>> "it
>>>>>>>>> >> >>> >>> >>>> is
>>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>>>>> Spark
>>>>>>>>> >> >>> >>> >>>> Structured
>>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>>>>> easy-of-use
>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>>>> environments
>>>>>>>>> >> >>> >>> >>>> (in
>>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>>>> cluster,
>>>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff
>>>>>>>>> to say
>>>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>>>> >> >>> >>> >>>> you're
>>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>>>> faster and is
>>>>>>>>> >> >>> >>> >>>> still
>>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>>>> facts (just
>>>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>>>>> marketing
>>>>>>>>> >> >>> >>> >>>> puproses
>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some
>>>>>>>>> time ago about
>>>>>>>>> >> >>> >>> >>>> real-time
>>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>>>> Some work
>>>>>>>>> >> >>> >>> >>>> should be
>>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think
>>>>>>>>> it's possible.
>>>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>>>>> top of
>>>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>>>> >> >>> >>> >>>> I
>>>>>>>>> >> >>> >>> >>>> don't
>>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I
>>>>>>>>> think that
>>>>>>>>> >> >>> >>> >>>> Spark
>>>>>>>>> >> >>> >>> >>>> should
>>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see
>>>>>>>>> many
>>>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming
>>>>>>>>> is doing
>>>>>>>>> >> >>> >>> >>>> very
>>>>>>>>> >> >>> >>> >>>> good
>>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>>>> possible to
>>>>>>>>> >> >>> >>> >>>> add
>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>> >> >>> >>> >>>> more
>>>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>>>> proposal of SIP.
>>>>>>>>> >> >>> >>> >>>> I'm
>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>>>>> listen to
>>>>>>>>> >> >>> >>> >>>> users,
>>>>>>>>> >> >>> >>> >>>> but
>>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every
>>>>>>>>> user.
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> What do you think about these two topics?
>>>>>>>>> Especially I'm
>>>>>>>>> >> >>> >>> >>>> looking
>>>>>>>>> >> >>> >>> >>>> at
>>>>>>>>> >> >>> >>> >>>> Cody
>>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> >> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> >> To unsubscribe e-mail: [email protected]
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Ryan Blue
>>>>>>>>> > Software Engineer
>>>>>>>>> > Netflix
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Joseph Bradley
>>>>>>>>
>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>
>>>>>>>> Databricks, Inc.
>>>>>>>>
>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Reply via email to