Re: Spark Improvement Proposals

Ryan Blue Thu, 16 Feb 2017 08:43:35 -0800

The current proposal seems process-heavy to me. That's not necessarily bad,
but there are a couple areas I haven't seen discussed.


Why is there a shepherd? If the person proposing a change has a good idea,
I don't see why one is either a good idea or necessary. The result of this
requirement is that each SPIP must attract the attention of a PMC member,
and that PMC member has then taken on extra responsibility. Why can't the
SPIP author simply call a vote when an idea has been sufficiently
discussed? I think *this* proposal would have moved faster if Cody had felt
empowered to bring it to a vote. More steps out of the author's control
will cause fewer ideas to move forward, regardless of quality, so we should
make sure this is balanced by a real benefit.

Why are only PMC members allowed a binding vote? I don't have a strong
inclination one way or another, but until recently this was an open
question. I'd like to hear the argument for restricting voting to PMC
members, or I think we should change it to allow all commiters. If this
decision is left to default, let's be more inclusive.

I would be fine with the proposal overall if there are good reasons behind
these choices.

rb

On Thu, Feb 16, 2017 at 8:22 AM, Reynold Xin <r...@databricks.com> wrote:

> Updated. Any feedback from other community members?
>
>
> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Thanks for doing that.
>>
>> Given that there are at least 4 different Apache voting processes,
>> "typical Apache vote process" isn't meaningful to me.
>>
>> I think the intention is that in order to pass, it needs at least 3 +1
>> votes from PMC members *and no -1 votes from PMC members*.  But the
>> document doesn't explicitly say that second part.
>>
>> There's also no mention of the duration a vote should remain open.
>> There's a mention of a month for finding a shepherd, but that's different.
>>
>> Other than that, LGTM.
>>
>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Here's a new draft that incorporated most of the feedback:
>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9h
>>> TK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>>
>>> I added a specific role for SPIP Author and another one for SPIP
>>> Shepherd.
>>>
>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>
>>>> During the summit, I also had a lot of discussions over similar topics
>>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>>> believe Spark improvement proposals are good channels to collect the
>>>> requirements/designs.
>>>>
>>>>
>>>> IMO, we also need to consider the priority when working on these items.
>>>> Even if the proposal is accepted, it does not mean it will be implemented
>>>> and merged immediately. It is not a FIFO queue.
>>>>
>>>>
>>>> Even if some PRs are merged, sometimes, we still have to revert them
>>>> back, if the design and implementation are not reviewed carefully. We have
>>>> to ensure our quality. Spark is not an application software. It is an
>>>> infrastructure software that is being used by many many companies. We have
>>>> to be very careful in the design and implementation, especially
>>>> adding/changing the external APIs.
>>>>
>>>>
>>>> When I developed the Mainframe infrastructure/middleware software in
>>>> the past 6 years, I were involved in the discussions with external/internal
>>>> customers. The to-do feature list was always above 100. Sometimes, the
>>>> customers are feeling frustrated when we are unable to deliver them on time
>>>> due to the resource limits and others. Even if they paid us billions, we
>>>> still need to do it phase by phase or sometimes they have to accept the
>>>> workarounds. That is the reality everyone has to face, I think.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Xiao Li
>>>>
>>>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:
>>>>
>>>>> At the spark summit this week, everyone from PMC members to users I
>>>>> had never met before were asking me about the Spark improvement proposals
>>>>> idea.  It's clear that it's a real community need.
>>>>>
>>>>> But it's been almost half a year, and nothing visible has been done.
>>>>>
>>>>> Reynold, are you going to do this?
>>>>>
>>>>> If so, when?
>>>>>
>>>>> If not, why?
>>>>>
>>>>> You already did the right thing by including long-deserved
>>>>> committers.  Please keep doing the right thing for the community.
>>>>>
>>>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> +1 on all counts (consensus, time bound, define roles)
>>>>>>
>>>>>> I can update the doc in the next few days and share back. Then maybe
>>>>>> we can just officially vote on this. As Tim suggested, we might not get 
>>>>>> it
>>>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Cody,
>>>>>>> thank you for bringing up this topic, I agree it is very important
>>>>>>> to keep a cohesive community around some common, fluid goals. Here are a
>>>>>>> few comments about the current document:
>>>>>>>
>>>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". 
>>>>>>> SPIP
>>>>>>> sounds great.
>>>>>>>
>>>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>>>>> technical decisions with a lasting impact. As such, the template should
>>>>>>> emphasize the role of the various parties during this process:
>>>>>>>
>>>>>>>  - the SPIP author is responsible for building consensus. She is the
>>>>>>> champion driving the process forward and is responsible for ensuring 
>>>>>>> that
>>>>>>> the SPIP follows the general guidelines. The author should be 
>>>>>>> identified in
>>>>>>> the SPIP. The authorship of a SPIP can be transferred if the current 
>>>>>>> author
>>>>>>> is not interested and someone else wants to move the SPIP forward. There
>>>>>>> should probably be 2-3 authors at most for each SPIP.
>>>>>>>
>>>>>>>  - someone with voting power should probably shepherd the SPIP (and
>>>>>>> be recorded as such): ensuring that the final decision over the SPIP is
>>>>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>>>>> contribute to it, but rather makes sure it stands a chance of being
>>>>>>> approved when the vote happens. Also, if the author cannot find anyone 
>>>>>>> who
>>>>>>> would want to take this role, this proposal is likely to be rejected 
>>>>>>> anyway.
>>>>>>>
>>>>>>>  - users, committers, contributors have the roles already outlined
>>>>>>> in the document
>>>>>>>
>>>>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it
>>>>>>> should move swiftly into either being accepted or rejected, so that we 
>>>>>>> do
>>>>>>> not end up with a distracting long tail of half-hearted proposals.
>>>>>>>
>>>>>>> These rules are meant to be flexible, but the current document
>>>>>>> should be clear about who is in charge of a SPIP, and the state it is
>>>>>>> currently in.
>>>>>>>
>>>>>>> We have had long discussions over some very important questions such
>>>>>>> as approval. I do not have an opinion on these, but why not make a pick 
>>>>>>> and
>>>>>>> reevaluate this decision later? This is not a binding process at this 
>>>>>>> point.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I don't have a concern about voting vs consensus.
>>>>>>>>
>>>>>>>> I have a concern that whatever the decision making process is, it
>>>>>>>> is explicitly announced on the ticket for the given proposal, with an
>>>>>>>> explicit deadline, and an explicit outcome.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>>>>>
>>>>>>>>> My take on the specific issues Joseph mentioned:
>>>>>>>>>
>>>>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue
>>>>>>>>> made earlier for consensus:
>>>>>>>>>
>>>>>>>>> > Majority vs consensus: My rationale is that I don't think we
>>>>>>>>> want to consider a proposal approved if it had objections serious 
>>>>>>>>> enough
>>>>>>>>> that committers down-voted (or PMC depending on who gets a vote). If 
>>>>>>>>> these
>>>>>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>>>>>> community effort and I wouldn't want to move forward if up to half of 
>>>>>>>>> the
>>>>>>>>> community thinks it's an untenable idea.
>>>>>>>>>
>>>>>>>>> 2) Design doc template -- agree this would be useful, but also
>>>>>>>>> seems totally orthogonal to moving forward on the SIP proposal.
>>>>>>>>>
>>>>>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>>>>>
>>>>>>>>> One small addition:
>>>>>>>>>
>>>>>>>>> 4) Deciding on a name -- minor, but I think its wroth
>>>>>>>>> disambiguating from Scala's SIPs, and the best proposal I've heard is
>>>>>>>>> "SPIP".   At least, no one has objected.  (don't care enough that I'd
>>>>>>>>> object to anything else, though.)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <
>>>>>>>>> jos...@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Cody,
>>>>>>>>>>
>>>>>>>>>> Thanks for being persistent about this.  I too would like to see
>>>>>>>>>> this happen.  Reviewing the thread, it sounds like the main things
>>>>>>>>>> remaining are:
>>>>>>>>>> * Decide about a few issues
>>>>>>>>>> * Finalize the doc(s)
>>>>>>>>>> * Vote on this proposal
>>>>>>>>>>
>>>>>>>>>> Issues & TODOs:
>>>>>>>>>>
>>>>>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>>>>>> little preference here.  It sounds like something which could be 
>>>>>>>>>> tailored
>>>>>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>>>>>
>>>>>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>>>>>> regardless of this SIP discussion.)
>>>>>>>>>> * Reynold, are you still putting this together?
>>>>>>>>>>
>>>>>>>>>> (3) Template cleanups.  Listing some items mentioned above + a
>>>>>>>>>> new one w.r.t. Reynold's draft
>>>>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>>>>>> :
>>>>>>>>>> * Reinstate the "Where" section with links to current and past
>>>>>>>>>> SIPs
>>>>>>>>>> * Add field for stating explicit deadlines for approval
>>>>>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>>>>>
>>>>>>>>>> Thanks all!
>>>>>>>>>> Joseph
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <
>>>>>>>>>> c...@koeninger.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm bumping this one more time for the new year, and then I'm
>>>>>>>>>>> giving up.
>>>>>>>>>>>
>>>>>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>>>>>> suggested.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>>>>>> >
>>>>>>>>>>> > First, why lazy consensus? The proposal was for consensus,
>>>>>>>>>>> which is at least
>>>>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>>>>>> requires
>>>>>>>>>>> > getting to a point where there is agreement. Isn't that
>>>>>>>>>>> agreement what we
>>>>>>>>>>> > want to achieve with these proposals?
>>>>>>>>>>> >
>>>>>>>>>>> > Second, lazy consensus only removes the requirement for three
>>>>>>>>>>> +1 votes. Why
>>>>>>>>>>> > would we not want at least three committers to think something
>>>>>>>>>>> is a good
>>>>>>>>>>> > idea before adopting the proposal?
>>>>>>>>>>> >
>>>>>>>>>>> > rb
>>>>>>>>>>> >
>>>>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <
>>>>>>>>>>> c...@koeninger.org> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> So there are some minor things (the Where section heading
>>>>>>>>>>> appears to
>>>>>>>>>>> >> be dropped; wherever this document is posted it needs to
>>>>>>>>>>> actually link
>>>>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't
>>>>>>>>>>> look like
>>>>>>>>>>> >> I can comment on the google doc.
>>>>>>>>>>> >>
>>>>>>>>>>> >> The major substantive issue that I have is that this version
>>>>>>>>>>> is
>>>>>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>>>>>> >>
>>>>>>>>>>> >> The apache example of lazy consensus at
>>>>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus
>>>>>>>>>>> involves an
>>>>>>>>>>> >> explicit announcement of an explicit deadline, which I think
>>>>>>>>>>> are
>>>>>>>>>>> >> necessary for clarity.
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <
>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>>>>>> non-owners,
>>>>>>>>>>> >> > so
>>>>>>>>>>> >> > I've just merged all the edits in place. It should be
>>>>>>>>>>> visible now.
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>>>>>> r...@databricks.com>
>>>>>>>>>>> >> > wrote:
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>>>>>> c...@koeninger.org> wrote:
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on
>>>>>>>>>>> the document
>>>>>>>>>>> >> >>> you linked.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has
>>>>>>>>>>> less of an issue
>>>>>>>>>>> >> >>> with that, sure.  As long as it is clearly announced,
>>>>>>>>>>> lasts at least
>>>>>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> The other points are hard to comment on without being
>>>>>>>>>>> able to see the
>>>>>>>>>>> >> >>> text in question.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>>>>>> r...@databricks.com>
>>>>>>>>>>> >> >>> wrote:
>>>>>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>>>>>> there are a
>>>>>>>>>>> >> >>> > lot
>>>>>>>>>>> >> >>> > of
>>>>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>>>>>> first crack
>>>>>>>>>>> >> >>> > at
>>>>>>>>>>> >> >>> > the
>>>>>>>>>>> >> >>> > proposal.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of
>>>>>>>>>>> the most
>>>>>>>>>>> >> >>> > innovative
>>>>>>>>>>> >> >>> > and important projects in (big) data -- overall
>>>>>>>>>>> technical decisions
>>>>>>>>>>> >> >>> > made in
>>>>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as
>>>>>>>>>>> large and active
>>>>>>>>>>> >> >>> > as
>>>>>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>>>>>> community should
>>>>>>>>>>> >> >>> > strive
>>>>>>>>>>> >> >>> > to take it to the next level.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in
>>>>>>>>>>> my opinion
>>>>>>>>>>> >> >>> > are:
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>>>>>> difficult to
>>>>>>>>>>> >> >>> > know
>>>>>>>>>>> >> >>> > what
>>>>>>>>>>> >> >>> > really is going on. For people that don't follow
>>>>>>>>>>> closely, it is
>>>>>>>>>>> >> >>> > difficult to
>>>>>>>>>>> >> >>> > know what the important initiatives are. Even for
>>>>>>>>>>> people that do
>>>>>>>>>>> >> >>> > follow, it
>>>>>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>>>>>> attention,
>>>>>>>>>>> >> >>> > since the
>>>>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and
>>>>>>>>>>> it's difficult
>>>>>>>>>>> >> >>> > to
>>>>>>>>>>> >> >>> > extract signal from noise.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>>>>>> themselves)
>>>>>>>>>>> >> >>> > input
>>>>>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>>>>>> provides value
>>>>>>>>>>> >> >>> > because
>>>>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to
>>>>>>>>>>> build, but it is
>>>>>>>>>>> >> >>> > important
>>>>>>>>>>> >> >>> > to get their inputs.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>>>>>> ng=h.36ut37zh7w2b
>>>>>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended
>>>>>>>>>>> lazy consensus
>>>>>>>>>>> >> >>> > as
>>>>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>>>>>> easily be a
>>>>>>>>>>> >> >>> > "loser'
>>>>>>>>>>> >> >>> > that gets outvoted.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>>>>>> "optional
>>>>>>>>>>> >> >>> > design
>>>>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>>>>>> aside from
>>>>>>>>>>> >> >>> > tagging
>>>>>>>>>>> >> >>> > things and linking them elsewhere simply having design
>>>>>>>>>>> docs and
>>>>>>>>>>> >> >>> > prototypes
>>>>>>>>>>> >> >>> > implementations in PRs is not something that has not
>>>>>>>>>>> worked so far".
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>>>>>> visibility. For
>>>>>>>>>>> >> >>> > example,
>>>>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve",
>>>>>>>>>>> rather than just
>>>>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails
>>>>>>>>>>> that go to
>>>>>>>>>>> >> >>> > dev@.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>>>>>> suggested
>>>>>>>>>>> >> >>> > template
>>>>>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>>>>>> r...@databricks.com>
>>>>>>>>>>> >> >>> > wrote:
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>>>>>> take a
>>>>>>>>>>> >> >>> >> closer
>>>>>>>>>>> >> >>> >> look
>>>>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>>>>>> >> >>> >> <van...@cloudera.com>
>>>>>>>>>>> >> >>> >> wrote:
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though
>>>>>>>>>>> it's not
>>>>>>>>>>> >> >>> >>> explicitly
>>>>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A
>>>>>>>>>>> template for the
>>>>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice)
>>>>>>>>>>> would also be
>>>>>>>>>>> >> >>> >>> nice,
>>>>>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I
>>>>>>>>>>> consider a
>>>>>>>>>>> >> >>> >>> candidate
>>>>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>>>>>> attached even
>>>>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone
>>>>>>>>>>> wants to try
>>>>>>>>>>> >> >>> >>> out
>>>>>>>>>>> >> >>> >>> the process...
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>>>>>> >> >>> >>> <c...@koeninger.org>
>>>>>>>>>>> >> >>> >>> wrote:
>>>>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any
>>>>>>>>>>> committers
>>>>>>>>>>> >> >>> >>> > interested
>>>>>>>>>>> >> >>> >>> > in
>>>>>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the
>>>>>>>>>>> vine?
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>>>>>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote:
>>>>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or
>>>>>>>>>>> any other
>>>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>>>> >> >>> >>> >> The
>>>>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show
>>>>>>>>>>> that Spark is
>>>>>>>>>>> >> >>> >>> >> still on
>>>>>>>>>>> >> >>> >>> >> the
>>>>>>>>>>> >> >>> >>> >> top
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but
>>>>>>>>>>> I don't think
>>>>>>>>>>> >> >>> >>> >> they're the
>>>>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>>>>>> page there
>>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>>> >> >>> >>> >> still
>>>>>>>>>>> >> >>> >>> >> chart
>>>>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>>>>>> framework is
>>>>>>>>>>> >> >>> >>> >> not
>>>>>>>>>>> >> >>> >>> >> the
>>>>>>>>>>> >> >>> >>> >> same
>>>>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and
>>>>>>>>>>> optimized, comparable
>>>>>>>>>>> >> >>> >>> >> or
>>>>>>>>>>> >> >>> >>> >> even
>>>>>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be
>>>>>>>>>>> just good to see
>>>>>>>>>>> >> >>> >>> >> it
>>>>>>>>>>> >> >>> >>> >> in
>>>>>>>>>>> >> >>> >>> >> Spark.
>>>>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices
>>>>>>>>>>> that says "we
>>>>>>>>>>> >> >>> >>> >> need
>>>>>>>>>>> >> >>> >>> >> more" -
>>>>>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>>>>>> them. With
>>>>>>>>>>> >> >>> >>> >> SIPs
>>>>>>>>>>> >> >>> >>> >> it
>>>>>>>>>>> >> >>> >>> >> would
>>>>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>>>>>> that may be
>>>>>>>>>>> >> >>> >>> >> changed
>>>>>>>>>>> >> >>> >>> >> with
>>>>>>>>>>> >> >>> >>> >> SIP".
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is
>>>>>>>>>>> a lot of
>>>>>>>>>>> >> >>> >>> >> algorithms
>>>>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>>>>>> background
>>>>>>>>>>> >> >>> >>> >> (articles,
>>>>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that
>>>>>>>>>>> Spark is still
>>>>>>>>>>> >> >>> >>> >> modern
>>>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>>>>>> >> >>> >>> >> organizational
>>>>>>>>>>> >> >>> >>> >> ideas
>>>>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my
>>>>>>>>>>> mail was just
>>>>>>>>>>> >> >>> >>> >> to
>>>>>>>>>>> >> >>> >>> >> show
>>>>>>>>>>> >> >>> >>> >> some
>>>>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer
>>>>>>>>>>> and person
>>>>>>>>>>> >> >>> >>> >> who
>>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>>> >> >>> >>> >> trying
>>>>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or
>>>>>>>>>>> other ways)
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> Tomasz
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> ________________________________
>>>>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org>
>>>>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks
>>>>>>>>>>> is missing my
>>>>>>>>>>> >> >>> >>> >> point.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>>>>>> organization
>>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically,
>>>>>>>>>>> and it needs
>>>>>>>>>>> >> >>> >>> >> to
>>>>>>>>>>> >> >>> >>> >> change.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>>>>>> >> >>> >>> >> <debasish.da...@gmail.com>
>>>>>>>>>>> >> >>> >>> >> wrote:
>>>>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I
>>>>>>>>>>> picked up Spark
>>>>>>>>>>> >> >>> >>> >>> in
>>>>>>>>>>> >> >>> >>> >>> 2014
>>>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to
>>>>>>>>>>> writing Java
>>>>>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed
>>>>>>>>>>> code fun...But
>>>>>>>>>>> >> >>> >>> >>> now
>>>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>>>> >> >>> >>> >>> we
>>>>>>>>>>> >> >>> >>> >>> went
>>>>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming
>>>>>>>>>>> use-case gets more
>>>>>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>>>>>> conjunction
>>>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>>>> >> >>> >>> >>> the
>>>>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>>>>>> at....akka-streams
>>>>>>>>>>> >> >>> >>> >>> close
>>>>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks
>>>>>>>>>>> like a great
>>>>>>>>>>> >> >>> >>> >>> direction to
>>>>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>>>>>> integrated
>>>>>>>>>>> >> >>> >>> >>> streaming
>>>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching
>>>>>>>>>>> is sufficient
>>>>>>>>>>> >> >>> >>> >>> to
>>>>>>>>>>> >> >>> >>> >>> run
>>>>>>>>>>> >> >>> >>> >>> SQL
>>>>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to
>>>>>>>>>>> do SQL
>>>>>>>>>>> >> >>> >>> >>> processing at
>>>>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look
>>>>>>>>>>> into Flink
>>>>>>>>>>> >> >>> >>> >>> documentation
>>>>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>>>>>> think we
>>>>>>>>>>> >> >>> >>> >>> have
>>>>>>>>>>> >> >>> >>> >>> major
>>>>>>>>>>> >> >>> >>> >>> work
>>>>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>>>>>> people from
>>>>>>>>>>> >> >>> >>> >>> community
>>>>>>>>>>> >> >>> >>> >>> start
>>>>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so
>>>>>>>>>>> that Spark
>>>>>>>>>>> >> >>> >>> >>> stays
>>>>>>>>>>> >> >>> >>> >>> strong
>>>>>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>>>>>> micro-batch and
>>>>>>>>>>> >> >>> >>> >>> batch...We
>>>>>>>>>>> >> >>> >>> >>> (and
>>>>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an
>>>>>>>>>>> engine for
>>>>>>>>>>> >> >>> >>> >>> stream
>>>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>>>> >> >>> >>> >>> query
>>>>>>>>>>> >> >>> >>> >>> processing.....we need to make it a
>>>>>>>>>>> state-of-the-art engine
>>>>>>>>>>> >> >>> >>> >>> for
>>>>>>>>>>> >> >>> >>> >>> high
>>>>>>>>>>> >> >>> >>> >>> speed
>>>>>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>>>>>> >> >>> >>> >>> <tomasz.gaw...@outlook.com>
>>>>>>>>>>> >> >>> >>> >>> wrote:
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>>>>>> suggestions may
>>>>>>>>>>> >> >>> >>> >>>> help a
>>>>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>>>>>> topics were
>>>>>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts
>>>>>>>>>>> about Spark and
>>>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very
>>>>>>>>>>> good community
>>>>>>>>>>> >> >>> >>> >>>> -
>>>>>>>>>>> >> >>> >>> >>>> it's
>>>>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to
>>>>>>>>>>> "flight" on
>>>>>>>>>>> >> >>> >>> >>>> "framework
>>>>>>>>>>> >> >>> >>> >>>> market"
>>>>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and
>>>>>>>>>>> Big Data
>>>>>>>>>>> >> >>> >>> >>>> communities,
>>>>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>>>>>> enough time
>>>>>>>>>>> >> >>> >>> >>>> to
>>>>>>>>>>> >> >>> >>> >>>> join
>>>>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job.
>>>>>>>>>>> So why are
>>>>>>>>>>> >> >>> >>> >>>> some
>>>>>>>>>>> >> >>> >>> >>>> people
>>>>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is
>>>>>>>>>>> better, like it was
>>>>>>>>>>> >> >>> >>> >>>> posted
>>>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that
>>>>>>>>>>> framework is better
>>>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>>>> >> >>> >>> >>>> all
>>>>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>>>>>> where
>>>>>>>>>>> >> >>> >>> >>>> started
>>>>>>>>>>> >> >>> >>> >>>> after
>>>>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>>>>>> StackOverflow
>>>>>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>>>>>> >> >>> >>> >>>> vs
>>>>>>>>>>> >> >>> >>> >>>> ...."
>>>>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>>>>>> Answers are
>>>>>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's
>>>>>>>>>>> users (often
>>>>>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>>>>>> >> >>> >>> >>>> are
>>>>>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>>>>>> streaming,
>>>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>>>> >> >>> >>> >>>> delta
>>>>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it
>>>>>>>>>>> is marked as
>>>>>>>>>>> >> >>> >>> >>>> an
>>>>>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all
>>>>>>>>>>> the truth.
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>>>>>> knowledgle to
>>>>>>>>>>> >> >>> >>> >>>> perform
>>>>>>>>>>> >> >>> >>> >>>> huge
>>>>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that
>>>>>>>>>>> supports Spark
>>>>>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>>>>>> community :) )
>>>>>>>>>>> >> >>> >>> >>>> could
>>>>>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>>>>>> because of
>>>>>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>>>>>> >> >>> >>> >>>> model, however currently the difference should
>>>>>>>>>>> be much lower
>>>>>>>>>>> >> >>> >>> >>>> that in
>>>>>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is
>>>>>>>>>>> also a modern
>>>>>>>>>>> >> >>> >>> >>>> framework,
>>>>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above
>>>>>>>>>>> people may think
>>>>>>>>>>> >> >>> >>> >>>> "it
>>>>>>>>>>> >> >>> >>> >>>> is
>>>>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about
>>>>>>>>>>> how Spark
>>>>>>>>>>> >> >>> >>> >>>> Structured
>>>>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms
>>>>>>>>>>> of easy-of-use
>>>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>>>>>> environments
>>>>>>>>>>> >> >>> >>> >>>> (in
>>>>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>>>>>> cluster,
>>>>>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing
>>>>>>>>>>> stuff to say
>>>>>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>>>>>> >> >>> >>> >>>> you're
>>>>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>>>>>> faster and is
>>>>>>>>>>> >> >>> >>> >>>> still
>>>>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>>>>>> facts (just
>>>>>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies,
>>>>>>>>>>> for marketing
>>>>>>>>>>> >> >>> >>> >>>> puproses
>>>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some
>>>>>>>>>>> time ago about
>>>>>>>>>>> >> >>> >>> >>>> real-time
>>>>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>>>>>> Some work
>>>>>>>>>>> >> >>> >>> >>>> should be
>>>>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think
>>>>>>>>>>> it's possible.
>>>>>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built
>>>>>>>>>>> on top of
>>>>>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>>>>>> >> >>> >>> >>>> I
>>>>>>>>>>> >> >>> >>> >>>> don't
>>>>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I
>>>>>>>>>>> think that
>>>>>>>>>>> >> >>> >>> >>>> Spark
>>>>>>>>>>> >> >>> >>> >>>> should
>>>>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I
>>>>>>>>>>> see many
>>>>>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark
>>>>>>>>>>> Streaming is doing
>>>>>>>>>>> >> >>> >>> >>>> very
>>>>>>>>>>> >> >>> >>> >>>> good
>>>>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>>>>>> possible to
>>>>>>>>>>> >> >>> >>> >>>> add
>>>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>>>> >> >>> >>> >>>> more
>>>>>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>>>>>> proposal of SIP.
>>>>>>>>>>> >> >>> >>> >>>> I'm
>>>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will
>>>>>>>>>>> not listen to
>>>>>>>>>>> >> >>> >>> >>>> users,
>>>>>>>>>>> >> >>> >>> >>>> but
>>>>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every
>>>>>>>>>>> user.
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> What do you think about these two topics?
>>>>>>>>>>> Especially I'm
>>>>>>>>>>> >> >>> >>> >>>> looking
>>>>>>>>>>> >> >>> >>> >>>> at
>>>>>>>>>>> >> >>> >>> >>>> Cody
>>>>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >
>>>>>>>>>>> >> >
>>>>>>>>>>> >>
>>>>>>>>>>> >> ------------------------------------------------------------
>>>>>>>>>>> ---------
>>>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>> >>
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > --
>>>>>>>>>>> > Ryan Blue
>>>>>>>>>>> > Software Engineer
>>>>>>>>>>> > Netflix
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>>> ---------
>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Joseph Bradley
>>>>>>>>>>
>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>>
>>>>>>>>>> Databricks, Inc.
>>>>>>>>>>
>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark Improvement Proposals

Reply via email to