Yuval, Not sure what in-scope to land in 2.0, but there is another new infra bit to manage state more efficiently called State Store, whose initial version is already commited: SPARK-13809 - State Store: A new framework for state management for computing Streaming Aggregates https://issues.apache.org/jira/browse/SPARK-13809 Eventually the pull request links into the design doc, that discusses the limits of updateStateByKey and mapWithState and how that will be handled...
At a quick glance at the code, it seems to be used already in streaming aggregations. Just my two cents, Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov <yuva...@gmail.com> wrote: > Also, re-reading the relevant part from the Structured Streaming > documentation ( > https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x > ): > Discretized streams (aka dstream) > > Unlike Storm, dstream exposes a higher level API similar to RDDs. There > are two main challenges with dstream: > > > 1. > > Similar to Storm, it exposes a monotonic system (processing) time > metric, and makes support for event time difficult. > 2. > > Its APIs are tied to the underlying microbatch execution model, and as > a result lead to inflexibilities such as changing the underlying batch > interval would require changing the window size. > > > RQ addresses the above: > > > 1. > > RQ operations support both system time and event time. > 2. > > RQ APIs are decoupled from the underlying execution model. As a matter > of fact, it is possible to implement an alternative engine that is not > microbatch-based for RQ. > 3. In addition, due to the declarative specification of operations, RQ > leverages a relational query optimizer and can often generate more > efficient query plans. > > > This doesn't seem to attack the actual underlying implementation for how > things like "mapWithState" are going to be translated into RQ, and I think > thats the hole that's causing my misunderstanding. > > On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yuva...@gmail.com> wrote: > >> Hi Ofir, >> Thanks for the elaborated answer. I have read both documents, where they >> do a light touch on infinite Dataframes/Datasets. However, they do not go >> in depth as regards to how existing transformations on DStreams, for >> example, will be transformed into the Dataset APIs. I've been browsing the >> 2.0 branch and have yet been able to understand how they correlate. >> >> Also, placing SparkSession in the sql package seems like a peculiar >> choice, since this is going to be the global abstraction over >> SparkContext/StreamingContext from now on. >> >> On Sun, May 15, 2016, 23:42 Ofir Manor <ofir.ma...@equalum.io> wrote: >> >>> Hi Yuval, >>> let me share my understanding based on similar questions I had. >>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two >>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset >>> (merging of Dataset and Dataframe - which is why it inherits all the >>> SparkSQL goodness), while RDD seems as a low-level API only for special >>> cases. The new Dataset should also support both batch and streaming - >>> replacing (eventually) DStream as well. See the design docs in SPARK-13485 >>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro. >>> However, as you noted, not all will be fully delivered in 2.0. For >>> example, it seems that streaming from / to Kafka using StructuredStreaming >>> didn't make it (so far?) to 2.0 (which is a showstopper for me). >>> Anyway, as far as I understand, you should be able to apply stateful >>> operators (non-RDD) on Datasets (for example, the new event-time window >>> processing SPARK-8360). The gap I see is mostly limited streaming sources / >>> sinks migrated to the new (richer) API and semantics. >>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and >>> examples will align with the current offering... >>> >>> >>> Ofir Manor >>> >>> Co-Founder & CTO | Equalum >>> >>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io >>> >>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com> >>> wrote: >>> >>>> I've been reading/watching videos about the upcoming Spark 2.0 release >>>> which >>>> brings us Structured Streaming. One thing I've yet to understand is how >>>> this >>>> relates to the current state of working with Streaming in Spark with the >>>> DStream abstraction. >>>> >>>> All examples I can find, in the Spark repository/different videos is >>>> someone >>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when >>>> browsing >>>> the source, SparkSession seems to be defined inside >>>> org.apache.spark.sql, so >>>> this gives me a hunch that this is somehow all related to SQL and the >>>> likes, >>>> and not really to DStreams. >>>> >>>> What I'm failing to understand is: Will this feature impact how we do >>>> Streaming today? Will I be able to consume a Kafka source in a streaming >>>> fashion (like we do today when we open a stream using KafkaUtils)? Will >>>> we >>>> be able to do state-full operations on a Dataset[T] like we do today >>>> using >>>> MapWithStateRDD? Or will there be a subset of operations that the >>>> catalyst >>>> optimizer can understand such as aggregate and such? >>>> >>>> I'd be happy anyone could shed some light on this. >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>>