Yuval,
Not sure what in-scope to land in 2.0, but there is another new infra bit
to manage state more efficiently called State Store, whose initial version
is already commited:
   SPARK-13809 - State Store: A new framework for state management for
computing Streaming Aggregates
https://issues.apache.org/jira/browse/SPARK-13809
Eventually the pull request links into the design doc, that discusses the
limits of updateStateByKey and mapWithState and how that will be
handled...

At a quick glance at the code, it seems to be used already in streaming
aggregations.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov <yuva...@gmail.com> wrote:

> Also, re-reading the relevant part from the Structured Streaming
> documentation (
> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
> ):
> Discretized streams (aka dstream)
>
> Unlike Storm, dstream exposes a higher level API similar to RDDs. There
> are two main challenges with dstream:
>
>
>    1.
>
>    Similar to Storm, it exposes a monotonic system (processing) time
>    metric, and makes support for event time difficult.
>    2.
>
>    Its APIs are tied to the underlying microbatch execution model, and as
>    a result lead to inflexibilities such as changing the underlying batch
>    interval would require changing the window size.
>
>
> RQ addresses the above:
>
>
>    1.
>
>    RQ operations support both system time and event time.
>    2.
>
>    RQ APIs are decoupled from the underlying execution model. As a matter
>    of fact, it is possible to implement an alternative engine that is not
>    microbatch-based for RQ.
>    3. In addition, due to the declarative specification of operations, RQ
>    leverages a relational query optimizer and can often generate more
>    efficient query plans.
>
>
> This doesn't seem to attack the actual underlying implementation for how
> things like "mapWithState" are going to be translated into RQ, and I think
> thats the hole that's causing my misunderstanding.
>
> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yuva...@gmail.com> wrote:
>
>> Hi Ofir,
>> Thanks for the elaborated answer. I have read both documents, where they
>> do a light touch on infinite Dataframes/Datasets. However, they do not go
>> in depth as regards to how existing transformations on DStreams, for
>> example, will be transformed into the Dataset APIs. I've been browsing the
>> 2.0 branch and have yet been able to understand how they correlate.
>>
>> Also, placing SparkSession in the sql package seems like a peculiar
>> choice, since this is going to be the global abstraction over
>> SparkContext/StreamingContext from now on.
>>
>> On Sun, May 15, 2016, 23:42 Ofir Manor <ofir.ma...@equalum.io> wrote:
>>
>>> Hi Yuval,
>>> let me share my understanding based on similar questions I had.
>>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>>> (merging of Dataset and Dataframe - which is why it inherits all the
>>> SparkSQL goodness), while RDD seems as a low-level API only for special
>>> cases. The new Dataset should also support both batch and streaming -
>>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>>> However, as you noted, not all will be fully delivered in 2.0. For
>>> example, it seems that streaming from / to Kafka using StructuredStreaming
>>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>>> Anyway, as far as I understand, you should be able to apply stateful
>>> operators (non-RDD) on Datasets (for example, the new event-time window
>>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>>> sinks migrated to the new (richer) API and semantics.
>>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>>> examples will align with the current offering...
>>>
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com>
>>> wrote:
>>>
>>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>>> which
>>>> brings us Structured Streaming. One thing I've yet to understand is how
>>>> this
>>>> relates to the current state of working with Streaming in Spark with the
>>>> DStream abstraction.
>>>>
>>>> All examples I can find, in the Spark repository/different videos is
>>>> someone
>>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>>> browsing
>>>> the source, SparkSession seems to be defined inside
>>>> org.apache.spark.sql, so
>>>> this gives me a hunch that this is somehow all related to SQL and the
>>>> likes,
>>>> and not really to DStreams.
>>>>
>>>> What I'm failing to understand is: Will this feature impact how we do
>>>> Streaming today? Will I be able to consume a Kafka source in a streaming
>>>> fashion (like we do today when we open a stream using KafkaUtils)? Will
>>>> we
>>>> be able to do state-full operations on a Dataset[T] like we do today
>>>> using
>>>> MapWithStateRDD? Or will there be a subset of operations that the
>>>> catalyst
>>>> optimizer can understand such as aggregate and such?
>>>>
>>>> I'd be happy anyone could shed some light on this.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>

Reply via email to