I don't think it's just about what to target - if you could target 1ms batches, without harming 1 second or 1 minute batches.... why wouldn't you? I think it's about having a clear strategy and dedicating resources to it. If scheduling batches at an order of magnitude or two lower latency is the strategy, and that's actually feasible, that's great. But I haven't seen that clear direction, and this is by no means a recent issue.
On Oct 19, 2016 7:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote: > I'm also curious whether there are concerns other than latency with the > way stuff executes in Structured Streaming (now that the time steps don't > have to act as triggers), as well as what latency people want for various > apps. > > The stateful operator designs for streaming systems aren't inherently > "better" than micro-batching -- they lose a lot of stuff that is possible > in Spark, such as load balancing work dynamically across nodes, speculative > execution for stragglers, scaling clusters up and down elastically, etc. > Moreover, Spark itself could execute the current model with much lower > latency. The question is just what combinations of latency, throughput, > fault recovery, etc to target. > > Matei > > On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com> wrote: > > > > On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> At the AMPLab we've been working on a research project that looks at >> just the scheduling latencies and on techniques to get lower >> scheduling latency. It moves away from the micro-batch model, but >> reuses the fault tolerance etc. in Spark. However we haven't yet >> figure out all the parts in integrating this with the rest of >> structured streaming. I'll try to post a design doc / SIP about this >> soon. >> >> On a related note - are there other problems users face with >> micro-batch other than latency ? >> > I think that the fact that they serve as an output trigger is a problem, > but Structured Streaming seems to resolve this now. > >> >> Thanks >> Shivaram >> >> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust >> <mich...@databricks.com> wrote: >> > I know people are seriously thinking about latency. So far that has not >> > been the limiting factor in the users I've been working with. >> > >> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >> >> >> Is anyone seriously thinking about alternatives to microbatches? >> >> >> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust >> >> <mich...@databricks.com> wrote: >> >> > Anything that is actively being designed should be in JIRA, and it >> seems >> >> > like you found most of it. In general, release windows can be found >> on >> >> > the >> >> > wiki. >> >> > >> >> > 2.1 has a lot of stability fixes as well as the kafka support you >> >> > mentioned. >> >> > It may also include some of the following. >> >> > >> >> > The items I'd like to start thinking about next are: >> >> > - Evicting state from the store based on event time watermarks >> >> > - Sessionization (grouping together related events by key / >> eventTime) >> >> > - Improvements to the query planner (remove some of the >> restrictions on >> >> > what queries can be run). >> >> > >> >> > This is roughly in order based on what I've been hearing users hit >> the >> >> > most. >> >> > Would love more feedback on what is blocking real use cases. >> >> > >> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io> >> >> > wrote: >> >> >> >> >> >> Hi, >> >> >> I hope it is the right forum. >> >> >> I am looking for some information of what to expect from >> >> >> StructuredStreaming in its next releases to help me choose when / >> where >> >> >> to >> >> >> start using it more seriously (or where to invest in workarounds and >> >> >> where >> >> >> to wait). I couldn't find a good place where such planning discussed >> >> >> for 2.1 >> >> >> (like, for example ML and SPARK-15581). >> >> >> I'm aware of the 2.0 documented limits >> >> >> >> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming- >> programming-guide.html#unsupported-operations), >> >> >> like no support for multiple aggregations levels, joins are >> strictly to >> >> >> a >> >> >> static dataset (no SCD or stream-stream) etc, limited sources / >> sinks >> >> >> (like >> >> >> no sink for interactive queries) etc etc >> >> >> I'm also aware of some changes that have landed in master, like the >> new >> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, >> the >> >> >> metrics in SPARK-17731, and some improvements for the file source. >> >> >> If I remember correctly, the discussion on Spark release cadence >> >> >> concluded >> >> >> with a preference to a four-month cycles, with likely code freeze >> >> >> pretty >> >> >> soon (end of October). So I believe the scope for 2.1 should likely >> >> >> quite >> >> >> clear to some, and that 2.2 planning should likely be starting about >> >> >> now. >> >> >> Any visibility / sharing will be highly appreciated! >> >> >> thanks in advance, >> >> >> >> >> >> Ofir Manor >> >> >> >> >> >> Co-Founder & CTO | Equalum >> >> >> >> >> >> Mobile: +972-54-7801286 <054-780-1286> | Email: >> ofir.ma...@equalum.io >> >> > >> >> > >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >