Re: Kafka Streams vs Spark Streaming : reduce by window

Matthias J. Sax Fri, 16 Jun 2017 10:03:27 -0700

Thanks Michał!

That is very good feedback.



-Matthias

On 6/16/17 2:38 AM, Michal Borowiecki wrote:
> I wonder if it's a frequent enough use case that Kafka Streams should
> consider providing this out of the box - this was asked for multiple
> times, right?
> 
> Personally, I agree totally with the philosophy of "no final
> aggregation", as expressed by Eno's post, but IMO that is predicated
> totally on event-time semantics.
> 
> If users want processing-time semantics then, as the docs already point
> out, there is no such thing as a late-arriving record - every record
> just falls in the currently open window(s), hence the notion of final
> aggregation makes perfect sense, from the usability point of view.
> 
> The single abstraction of "stream time" proves leaky in some cases (e.g.
> for punctuate method - being addressed in KIP-138). Perhaps this is
> another case where processing-time semantics warrant explicit handling
> in the api - but of course, only if there's sufficient user demand for this.
> 
> What I could imagine is a new type of time window
> (ProcessingTimeWindow?), that if used in an aggregation, the underlying
> processor would force the WallclockTimestampExtractor (KAFKA-4144
> enables that) and would use the system-time punctuation (KIP-138) to
> send the final aggregation value once the window has expired and could
> be configured to not send intermediate updates while the window was open.
> 
> Of course this is just a helper for the users, since they can implement
> it all themselves using the low-level API, as Matthias pointed out
> already. Just seems there's recurring interest in this.
> 
> Again, this only makes sense for processing time semantics. For
> event-time semantics I find the arguments for "no final aggregation"
> totally convincing.
> 
> 
> Cheers,
> 
> Michał
> 
> 
> On 16/06/17 00:08, Matthias J. Sax wrote:
>> Hi Paolo,
>>
>> This SO question might help, too:
>> https://stackoverflow.com/questions/38935904/how-to-send-final-kafka-streams-aggregation-result-of-a-time-windowed-ktable
>>
>> For Streams, the basic model is based on "change" and we report updates
>> to the "current" result immediately reducing latency to a minimum.
>>
>> Last, if you say it's going to fall into the next window, you won't get
>> event time semantics but you fall back processing time semantics, that
>> cannot provide exact results....
>>
>> If you really want to trade-off correctness version getting (late)
>> updates and want to use processing time semantics, you should configure
>> WallclockTimestampExtractor and implement a "update deduplication"
>> operator using table.toStream().transform(). You can attached a state to
>> your transformer and store all update there (ie, newer update overwrite
>> older updates). Punctuations allow you to emit "final" results for
>> windows for which "window end time" passed.
>>
>>
>> -Matthias
>>
>> On 6/15/17 9:21 AM, Paolo Patierno wrote:
>>> Hi Eno,
>>>
>>>
>>> regarding closing window I think that it's up to the streaming application. 
>>> I mean ...
>>>
>>> If I want something like I described, I know that a value outside my 5 
>>> seconds window will be taken into account for the next processing (in the 
>>> next 5 seconds). I don't think I'm losing a record, I am ware that this 
>>> record will fall in the next "processing" window. Btw I'll take a look at 
>>> your article ! Thanks !
>>>
>>>
>>> Paolo
>>>
>>>
>>> Paolo Patierno
>>> Senior Software Engineer (IoT) @ Red Hat
>>> Microsoft MVP on Windows Embedded & IoT
>>> Microsoft Azure Advisor
>>>
>>> Twitter : @ppatierno<http://twitter.com/ppatierno>
>>> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno>
>>> Blog : DevExperience<http://paolopatierno.wordpress.com/>
>>>
>>>
>>> ________________________________
>>> From: Eno Thereska <eno.there...@gmail.com>
>>> Sent: Thursday, June 15, 2017 3:57 PM
>>> To: users@kafka.apache.org
>>> Subject: Re: Kafka Streams vs Spark Streaming : reduce by window
>>>
>>> Hi Paolo,
>>>
>>> Yeah, so if you want fewer records, you should actually "not" disable 
>>> cache. If you disable cache you'll get all the records as you described.
>>>
>>> About closing windows: if you close a window and a late record arrives that 
>>> should have been in that window, you basically lose the ability to process 
>>> that record. In Kafka Streams we are robust to that, in that we handle late 
>>> arriving records. There is a comparison here for example when we compare it 
>>> to other methods that depend on watermarks or triggers: 
>>> https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/ 
>>> <https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/>
>>>
>>> Eno
>>>
>>>
>>>> On 15 Jun 2017, at 14:57, Paolo Patierno <ppatie...@live.com> wrote:
>>>>
>>>> Hi Emo,
>>>>
>>>>
>>>> thanks for the reply !
>>>>
>>>> Regarding the cache I'm already using CACHE_MAX_BYTES_BUFFERING_CONFIG = 0 
>>>> (so disabling cache).
>>>>
>>>> Regarding the interactive query API (I'll take a look) it means that it's 
>>>> up to the application doing something like we have oob with Spark.
>>>>
>>>> May I ask what do you mean with "We don’t believe in closing windows" ? 
>>>> Isn't it much more code that user has to write for having the same result ?
>>>>
>>>> I'm exploring Kafka Streams and it's very powerful imho even because the 
>>>> usage is pretty simple but this scenario could have a lack against Spark.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Paolo.
>>>>
>>>>
>>>> Paolo Patierno
>>>> Senior Software Engineer (IoT) @ Red Hat
>>>> Microsoft MVP on Windows Embedded & IoT
>>>> Microsoft Azure Advisor
>>>>
>>>> Twitter : @ppatierno<http://twitter.com/ppatierno>
>>>> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno>
>>>> Blog : DevExperience<http://paolopatierno.wordpress.com/>
>>>>
>>>>
>>>> ________________________________
>>>> From: Eno Thereska <eno.there...@gmail.com>
>>>> Sent: Thursday, June 15, 2017 1:45 PM
>>>> To: users@kafka.apache.org
>>>> Subject: Re: Kafka Streams vs Spark Streaming : reduce by window
>>>>
>>>> Hi Paolo,
>>>>
>>>> That is indeed correct. We don’t believe in closing windows in Kafka 
>>>> Streams.
>>>> You could reduce the number of downstream records by using record caches: 
>>>> http://docs.confluent.io/current/streams/developer-guide.html#record-caches-in-the-dsl
>>>>  
>>>> <http://docs.confluent.io/current/streams/developer-guide.html#record-caches-in-the-dsl>.
>>>>
>>>> Alternatively you can just query the KTable whenever you want using the 
>>>> Interactive Query APIs (so when you query dictates what  data you 
>>>> receive), see this 
>>>> https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
>>>>  
>>>> <https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/>
>>>>
>>>> Thanks
>>>> Eno
>>>>> On Jun 15, 2017, at 2:38 PM, Paolo Patierno <ppatie...@live.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> using the streams library I noticed a difference (or there is a lack of 
>>>>> knowledge on my side)with Apache Spark.
>>>>>
>>>>> Imagine following scenario ...
>>>>>
>>>>>
>>>>> I have a source topic where numeric values come in and I want to check 
>>>>> the maximum value in the latest 5 seconds but ... putting the max value 
>>>>> into a destination topic every 5 seconds.
>>>>>
>>>>> This is what happens with reduceByWindow method in Spark.
>>>>>
>>>>> I'm using reduce on a KStream here that process the max value taking into 
>>>>> account previous values in the latest 5 seconds but the final value is 
>>>>> put into the destination topic for each incoming value.
>>>>>
>>>>>
>>>>> For example ...
>>>>>
>>>>>
>>>>> An application sends numeric values every 1 second.
>>>>>
>>>>> With Spark ... the source gets values every 1 second, process max in a 
>>>>> window of 5 seconds, puts the max into the destination every 5 seconds 
>>>>> (so when the window ends). If the sequence is 21, 25, 22, 20, 26 the 
>>>>> output will be just 26.
>>>>>
>>>>> With Kafka Streams ... the source gets values every 1 second, process max 
>>>>> in a window of 5 seconds, puts the max into the destination every 1 
>>>>> seconds (so every time an incoming value arrives). Of course, if for 
>>>>> example the sequence is 21, 25, 22, 20, 26 ... the output will be 21, 25, 
>>>>> 25, 25, 26.
>>>>>
>>>>>
>>>>> Is it possible with Kafka Streams ? Or it's something to do at 
>>>>> application level ?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Paolo
>>>>>
>>>>>
>>>>> Paolo Patierno
>>>>> Senior Software Engineer (IoT) @ Red Hat
>>>>> Microsoft MVP on Windows Embedded & IoT
>>>>> Microsoft Azure Advisor
>>>>>
>>>>> Twitter : @ppatierno<http://twitter.com/ppatierno>
>>>>> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno>
>>>>> Blog : DevExperience<http://paolopatierno.wordpress.com/>
>>>
> 
> -- 
> Signature
> <http://www.openbet.com/>     Michal Borowiecki
> Senior Software Engineer L4
>       T:      +44 208 742 1600
> 
>       
>       +44 203 249 8448
> 
>       
>        
>       E:      michal.borowie...@openbet.com
>       W:      www.openbet.com <http://www.openbet.com/>
> 
>       
>       OpenBet Ltd
> 
>       Chiswick Park Building 9
> 
>       566 Chiswick High Rd
> 
>       London
> 
>       W4 5XT
> 
>       UK
> 
>       
> <https://www.openbet.com/email_promo>
> 
> This message is confidential and intended only for the addressee. If you
> have received this message in error, please immediately notify the
> postmas...@openbet.com <mailto:postmas...@openbet.com> and delete it
> from your system as well as any copies. The content of e-mails as well
> as traffic data may be monitored by OpenBet for employment and security
> purposes. To protect the environment please do not print this e-mail
> unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building
> 9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company
> registered in England and Wales. Registered no. 3134634. VAT no.
> GB927523612
>

signature.asc
Description: OpenPGP digital signature

Re: Kafka Streams vs Spark Streaming : reduce by window

Reply via email to