HI Burak,

My response is inline.

Thanks a lot!

On Mon, May 22, 2017 at 9:26 AM, Burak Yavuz <brk...@gmail.com> wrote:

> Hi Kant,
>
>>
>>
>> 1. Can we use Spark Structured Streaming for stateless transformations
>> just like we would do with DStreams or Spark Structured Streaming is only
>> meant for stateful computations?
>>
>
> Of course you can do stateless transformations. Any map, filter, select,
> type of transformation is stateless. Aggregations are generally stateful.
> You could also perform arbitrary stateless aggregations with "
> flatMapGroups
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L145>"
> or make them stateful with "flatMapGroupsWithState
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L376>
> ".
>

    *Got it. so Spark Structured Streaming does both Stateful and Stateless
tranformations. In that case I am assuming DStreams API will be deprecated?
  How about groupBy ? That is stateful right?*

>
>
>
>> 2. When we use groupBy and Window operations for event time processing
>> and specify a watermark does this mean the timestamp field in each message
>> is compared to the processing time of that machine/node and discard that
>> events that are late than the specified threshold? If we don't specify a
>> watermark I am assuming the processing time wont come into the picture. is
>> that right? Just trying to understand the interplay between processing time
>> and even time when we do even time processing.
>>
>> Watermarks are tracked with respect to the event time of your data, not
> the processing time of the machine. Please take a look at the blog below
> for more details
> https://databricks.com/blog/2017/05/08/event-time-
> aggregation-watermarking-apache-sparks-structured-streaming.html
>

*Thanks for this article. I am not sure if I am interpreting the article
incorrectly buy Looks Like that Article shows there is indeed a
relationship between Processing time and event time. For example*
*say I set an Watermark of 10 minutes and *

*1. I send one message which has an event time stamp of May 22 2017 1PM and
Processing Time as May 22 2017 1:02 PM*


*2. I send another message which has an event time of May 22 2017 12:55 PM
and Processing Time as May 23 2017 1PM*

*Simply put, say I am just faking my event timestamp's to meet the cut off
specified by the watermark but I am actually sending them a day or week
later. How does Spark Structured Streaming handle this case? *

>
>
> Best,
> Burak
>

Reply via email to