On Thu, Oct 20, 2016 at 7:40 AM Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Yeah, as Shivaram pointed out, there have been research projects that > looked at it. Also, Structured Streaming was explicitly designed to not > make microbatching part of the API or part of the output behavior (tying > triggers to it). > But Streaming Query sources <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L41> are still designed with microbatches in mind, can this be removed and leave offset tracking to the executors ? > However, when people begin working on that is a function of demand > relative to other features. I don't think we can commit to one plan before > exploring more options, but basically there is Shivaram's project, which > adds a few new concepts to the scheduler, and there's the option to reduce > control plane latency in the current system, which hasn't been heavily > optimized yet but should be doable (lots of systems can handle 10,000s of > RPCs per second). > > Matei > > On Oct 19, 2016, at 9:20 PM, Cody Koeninger <c...@koeninger.org> wrote: > > I don't think it's just about what to target - if you could target 1ms > batches, without harming 1 second or 1 minute batches.... why wouldn't you? > I think it's about having a clear strategy and dedicating resources to it. > If scheduling batches at an order of magnitude or two lower latency is the > strategy, and that's actually feasible, that's great. But I haven't seen > that clear direction, and this is by no means a recent issue. > > On Oct 19, 2016 7:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote: > > I'm also curious whether there are concerns other than latency with the > way stuff executes in Structured Streaming (now that the time steps don't > have to act as triggers), as well as what latency people want for various > apps. > > The stateful operator designs for streaming systems aren't inherently > "better" than micro-batching -- they lose a lot of stuff that is possible > in Spark, such as load balancing work dynamically across nodes, speculative > execution for stragglers, scaling clusters up and down elastically, etc. > Moreover, Spark itself could execute the current model with much lower > latency. The question is just what combinations of latency, throughput, > fault recovery, etc to target. > > Matei > > On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com> wrote: > > > > On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > > At the AMPLab we've been working on a research project that looks at > just the scheduling latencies and on techniques to get lower > scheduling latency. It moves away from the micro-batch model, but > reuses the fault tolerance etc. in Spark. However we haven't yet > figure out all the parts in integrating this with the rest of > structured streaming. I'll try to post a design doc / SIP about this > soon. > > On a related note - are there other problems users face with > micro-batch other than latency ? > > I think that the fact that they serve as an output trigger is a problem, > but Structured Streaming seems to resolve this now. > > > Thanks > Shivaram > > On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust > <mich...@databricks.com> wrote: > > I know people are seriously thinking about latency. So far that has not > > been the limiting factor in the users I've been working with. > > > > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> > >> Is anyone seriously thinking about alternatives to microbatches? > >> > >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust > >> <mich...@databricks.com> wrote: > >> > Anything that is actively being designed should be in JIRA, and it > seems > >> > like you found most of it. In general, release windows can be found > on > >> > the > >> > wiki. > >> > > >> > 2.1 has a lot of stability fixes as well as the kafka support you > >> > mentioned. > >> > It may also include some of the following. > >> > > >> > The items I'd like to start thinking about next are: > >> > - Evicting state from the store based on event time watermarks > >> > - Sessionization (grouping together related events by key / > eventTime) > >> > - Improvements to the query planner (remove some of the restrictions > on > >> > what queries can be run). > >> > > >> > This is roughly in order based on what I've been hearing users hit the > >> > most. > >> > Would love more feedback on what is blocking real use cases. > >> > > >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io> > >> > wrote: > >> >> > >> >> Hi, > >> >> I hope it is the right forum. > >> >> I am looking for some information of what to expect from > >> >> StructuredStreaming in its next releases to help me choose when / > where > >> >> to > >> >> start using it more seriously (or where to invest in workarounds and > >> >> where > >> >> to wait). I couldn't find a good place where such planning discussed > >> >> for 2.1 > >> >> (like, for example ML and SPARK-15581). > >> >> I'm aware of the 2.0 documented limits > >> >> > >> >> ( > http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations > ), > >> >> like no support for multiple aggregations levels, joins are strictly > to > >> >> a > >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks > >> >> (like > >> >> no sink for interactive queries) etc etc > >> >> I'm also aware of some changes that have landed in master, like the > new > >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the > >> >> metrics in SPARK-17731, and some improvements for the file source. > >> >> If I remember correctly, the discussion on Spark release cadence > >> >> concluded > >> >> with a preference to a four-month cycles, with likely code freeze > >> >> pretty > >> >> soon (end of October). So I believe the scope for 2.1 should likely > >> >> quite > >> >> clear to some, and that 2.2 planning should likely be starting about > >> >> now. > >> >> Any visibility / sharing will be highly appreciated! > >> >> thanks in advance, > >> >> > >> >> Ofir Manor > >> >> > >> >> Co-Founder & CTO | Equalum > >> >> > >> >> Mobile: +972-54-7801286 <054-780-1286> | Email: > ofir.ma...@equalum.io > >> > > >> > > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > >