Re: Structured Streaming in Spark 2.0 and DStreams

Ofir Manor Sun, 15 May 2016 14:33:07 -0700

Ben,
I'm just a Spark user - but at least in March Spark Summit, that was the
main term used.
Taking a step back from the details, maybe this new post from Reynold is a
better intro to Spark 2.0 highlights....
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html


If you want to drill down, go to SPARK-8360 "Structured Streaming (aka
Streaming DataFrames)". The design doc (written by Reynold in March) is
very readable:
 https://issues.apache.org/jira/browse/SPARK-8360

Regarding directly querying (SQL) the state managed by a streaming process
- I don't know if that will land in 2.0 or only later.

Hope that helps,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Hi Ofir,
>
> I just recently saw the webinar with Reynold Xin. He mentioned the Spark
> Session unification efforts, but I don’t remember the DataSet for
> Structured Streaming aka Continuous Applications as he put it. He did
> mention streaming or unlimited DataFrames for Structured Streaming so one
> can directly query the data from it. Has something changed since then?
>
> Thanks,
> Ben
>
>
> On May 15, 2016, at 1:42 PM, Ofir Manor <ofir.ma...@equalum.io> wrote:
>
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
> (merging of Dataset and Dataframe - which is why it inherits all the
> SparkSQL goodness), while RDD seems as a low-level API only for special
> cases. The new Dataset should also support both batch and streaming -
> replacing (eventually) DStream as well. See the design docs in SPARK-13485
> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
> However, as you noted, not all will be fully delivered in 2.0. For
> example, it seems that streaming from / to Kafka using StructuredStreaming
> didn't make it (so far?) to 2.0 (which is a showstopper for me).
> Anyway, as far as I understand, you should be able to apply stateful
> operators (non-RDD) on Datasets (for example, the new event-time window
> processing SPARK-8360). The gap I see is mostly limited streaming sources /
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
> examples will align with the current offering...
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com>
> wrote:
>
>> I've been reading/watching videos about the upcoming Spark 2.0 release
>> which
>> brings us Structured Streaming. One thing I've yet to understand is how
>> this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>>
>> All examples I can find, in the Spark repository/different videos is
>> someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>> browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql,
>> so
>> this gives me a hunch that this is somehow all related to SQL and the
>> likes,
>> and not really to DStreams.
>>
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>>
>> I'd be happy anyone could shed some light on this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>

Re: Structured Streaming in Spark 2.0 and DStreams

Reply via email to