Re: StructuredStreaming status

Cody Koeninger Wed, 19 Oct 2016 21:26:07 -0700

I don't think it's just about what to target - if you could target 1ms
batches, without harming 1 second or 1 minute batches.... why wouldn't you?
I think it's about having a clear strategy and dedicating resources to it.
If  scheduling batches at an order of magnitude or two lower latency is the
strategy, and that's actually feasible, that's great. But I haven't seen
that clear direction, and this is by no means a recent issue.


On Oct 19, 2016 7:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:

> I'm also curious whether there are concerns other than latency with the
> way stuff executes in Structured Streaming (now that the time steps don't
> have to act as triggers), as well as what latency people want for various
> apps.
>
> The stateful operator designs for streaming systems aren't inherently
> "better" than micro-batching -- they lose a lot of stuff that is possible
> in Spark, such as load balancing work dynamically across nodes, speculative
> execution for stragglers, scaling clusters up and down elastically, etc.
> Moreover, Spark itself could execute the current model with much lower
> latency. The question is just what combinations of latency, throughput,
> fault recovery, etc to target.
>
> Matei
>
> On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
>
>
> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> At the AMPLab we've been working on a research project that looks at
>> just the scheduling latencies and on techniques to get lower
>> scheduling latency. It moves away from the micro-batch model, but
>> reuses the fault tolerance etc. in Spark. However we haven't yet
>> figure out all the parts in integrating this with the rest of
>> structured streaming. I'll try to post a design doc / SIP about this
>> soon.
>>
>> On a related note - are there other problems users face with
>> micro-batch other than latency ?
>>
> I think that the fact that they serve as an output trigger is a problem,
> but Structured Streaming seems to resolve this now.
>
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>> <mich...@databricks.com> wrote:
>> > I know people are seriously thinking about latency.  So far that has not
>> > been the limiting factor in the users I've been working with.
>> >
>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>> >>
>> >> Is anyone seriously thinking about alternatives to microbatches?
>> >>
>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> >> <mich...@databricks.com> wrote:
>> >> > Anything that is actively being designed should be in JIRA, and it
>> seems
>> >> > like you found most of it.  In general, release windows can be found
>> on
>> >> > the
>> >> > wiki.
>> >> >
>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>> >> > mentioned.
>> >> > It may also include some of the following.
>> >> >
>> >> > The items I'd like to start thinking about next are:
>> >> >  - Evicting state from the store based on event time watermarks
>> >> >  - Sessionization (grouping together related events by key /
>> eventTime)
>> >> >  - Improvements to the query planner (remove some of the
>> restrictions on
>> >> > what queries can be run).
>> >> >
>> >> > This is roughly in order based on what I've been hearing users hit
>> the
>> >> > most.
>> >> > Would love more feedback on what is blocking real use cases.
>> >> >
>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io>
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >> I hope it is the right forum.
>> >> >> I am looking for some information of what to expect from
>> >> >> StructuredStreaming in its next releases to help me choose when /
>> where
>> >> >> to
>> >> >> start using it more seriously (or where to invest in workarounds and
>> >> >> where
>> >> >> to wait). I couldn't find a good place where such planning discussed
>> >> >> for 2.1
>> >> >> (like, for example ML and SPARK-15581).
>> >> >> I'm aware of the 2.0 documented limits
>> >> >>
>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-
>> programming-guide.html#unsupported-operations),
>> >> >> like no support for multiple aggregations levels, joins are
>> strictly to
>> >> >> a
>> >> >> static dataset (no SCD or stream-stream) etc, limited sources /
>> sinks
>> >> >> (like
>> >> >> no sink for interactive queries) etc etc
>> >> >> I'm also aware of some changes that have landed in master, like the
>> new
>> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406,
>> the
>> >> >> metrics in SPARK-17731, and some improvements for the file source.
>> >> >> If I remember correctly, the discussion on Spark release cadence
>> >> >> concluded
>> >> >> with a preference to a four-month cycles, with likely code freeze
>> >> >> pretty
>> >> >> soon (end of October). So I believe the scope for 2.1 should likely
>> >> >> quite
>> >> >> clear to some, and that 2.2 planning should likely be starting about
>> >> >> now.
>> >> >> Any visibility / sharing will be highly appreciated!
>> >> >> thanks in advance,
>> >> >>
>> >> >> Ofir Manor
>> >> >>
>> >> >> Co-Founder & CTO | Equalum
>> >> >>
>> >> >> Mobile: +972-54-7801286 <054-780-1286> | Email:
>> ofir.ma...@equalum.io
>> >> >
>> >> >
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

Re: StructuredStreaming status

Reply via email to