RE: StructuredStreaming status

assaf.mendelson Wed, 19 Oct 2016 22:36:58 -0700

There is one issue I was thinking of.
If I understand correctly, structured streaming basically groups by a bucket 
for time in sliding window (of the step). My problem is that in some cases 
(e.g. distinct count and any other case where the buffer is relatively large) 
this would mean copying the buffer for each step. The can have a very large 
memory overhead.
There are other solutions for this. For example, let's say we would have 
implemented distinct count by saving a map with the key being the distinct 
value and the value being the last time we saw this value. This would mean that 
we wouldn't really need to save all the steps in the middle and copy the data, 
we could only save the last portion.
This is just an idea for optimization though, certainly nothing of high 
priority.

From: Matei Zaharia [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19513...@n3.nabble.com]
Sent: Thursday, October 20, 2016 3:42 AM
To: Mendelson, Assaf
Subject: Re: StructuredStreaming status

I'm also curious whether there are concerns other than latency with the way 
stuff executes in Structured Streaming (now that the time steps don't have to 
act as triggers), as well as what latency people want for various apps.

The stateful operator designs for streaming systems aren't inherently "better" 
than micro-batching -- they lose a lot of stuff that is possible in Spark, such 
as load balancing work dynamically across nodes, speculative execution for 
stragglers, scaling clusters up and down elastically, etc. Moreover, Spark 
itself could execute the current model with much lower latency. The question is 
just what combinations of latency, throughput, fault recovery, etc to target.

Matei

On Oct 19, 2016, at 2:18 PM, Amit Sela <[hidden 
email]</user/SendEmail.jtp?type=node&node=19513&i=0>> wrote:

On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <[hidden 
email]</user/SendEmail.jtp?type=node&node=19513&i=1>> wrote:
At the AMPLab we've been working on a research project that looks at
just the scheduling latencies and on techniques to get lower
scheduling latency. It moves away from the micro-batch model, but
reuses the fault tolerance etc. in Spark. However we haven't yet
figure out all the parts in integrating this with the rest of
structured streaming. I'll try to post a design doc / SIP about this
soon.

On a related note - are there other problems users face with
micro-batch other than latency ?
I think that the fact that they serve as an output trigger is a problem, but 
Structured Streaming seems to resolve this now.

Thanks
Shivaram

On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
<[hidden email]</user/SendEmail.jtp?type=node&node=19513&i=2>> wrote:
> I know people are seriously thinking about latency.  So far that has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <[hidden 
> email]</user/SendEmail.jtp?type=node&node=19513&i=3>> wrote:
>>
>> Is anyone seriously thinking about alternatives to microbatches?
>>
>> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> <[hidden email]</user/SendEmail.jtp?type=node&node=19513&i=4>> wrote:
>> > Anything that is actively being designed should be in JIRA, and it seems
>> > like you found most of it.  In general, release windows can be found on
>> > the
>> > wiki.
>> >
>> > 2.1 has a lot of stability fixes as well as the kafka support you
>> > mentioned.
>> > It may also include some of the following.
>> >
>> > The items I'd like to start thinking about next are:
>> >  - Evicting state from the store based on event time watermarks
>> >  - Sessionization (grouping together related events by key / eventTime)
>> >  - Improvements to the query planner (remove some of the restrictions on
>> > what queries can be run).
>> >
>> > This is roughly in order based on what I've been hearing users hit the
>> > most.
>> > Would love more feedback on what is blocking real use cases.
>> >
>> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <[hidden 
>> > email]</user/SendEmail.jtp?type=node&node=19513&i=5>>
>> > wrote:
>> >>
>> >> Hi,
>> >> I hope it is the right forum.
>> >> I am looking for some information of what to expect from
>> >> StructuredStreaming in its next releases to help me choose when / where
>> >> to
>> >> start using it more seriously (or where to invest in workarounds and
>> >> where
>> >> to wait). I couldn't find a good place where such planning discussed
>> >> for 2.1
>> >> (like, for example ML and SPARK-15581).
>> >> I'm aware of the 2.0 documented limits
>> >>
>> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> >> like no support for multiple aggregations levels, joins are strictly to
>> >> a
>> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
>> >> (like
>> >> no sink for interactive queries) etc etc
>> >> I'm also aware of some changes that have landed in master, like the new
>> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> >> metrics in SPARK-17731, and some improvements for the file source.
>> >> If I remember correctly, the discussion on Spark release cadence
>> >> concluded
>> >> with a preference to a four-month cycles, with likely code freeze
>> >> pretty
>> >> soon (end of October). So I believe the scope for 2.1 should likely
>> >> quite
>> >> clear to some, and that 2.2 planning should likely be starting about
>> >> now.
>> >> Any visibility / sharing will be highly appreciated!
>> >> thanks in advance,
>> >>
>> >> Ofir Manor
>> >>
>> >> Co-Founder & CTO | Equalum
>> >>
>> >> Mobile: <a href="tel:054-780-1286" value="+972547801286" 
>> >> class="gmail_msg" target="_blank">+972-54-7801286 | Email: [hidden 
>> >> email]</user/SendEmail.jtp?type=node&node=19513&i=6>
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden 
email]</user/SendEmail.jtp?type=node&node=19513&i=7>

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19513.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19519.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: StructuredStreaming status

Reply via email to