Thanks Michał! That is very good feedback.
-Matthias On 6/16/17 2:38 AM, Michal Borowiecki wrote: > I wonder if it's a frequent enough use case that Kafka Streams should > consider providing this out of the box - this was asked for multiple > times, right? > > Personally, I agree totally with the philosophy of "no final > aggregation", as expressed by Eno's post, but IMO that is predicated > totally on event-time semantics. > > If users want processing-time semantics then, as the docs already point > out, there is no such thing as a late-arriving record - every record > just falls in the currently open window(s), hence the notion of final > aggregation makes perfect sense, from the usability point of view. > > The single abstraction of "stream time" proves leaky in some cases (e.g. > for punctuate method - being addressed in KIP-138). Perhaps this is > another case where processing-time semantics warrant explicit handling > in the api - but of course, only if there's sufficient user demand for this. > > What I could imagine is a new type of time window > (ProcessingTimeWindow?), that if used in an aggregation, the underlying > processor would force the WallclockTimestampExtractor (KAFKA-4144 > enables that) and would use the system-time punctuation (KIP-138) to > send the final aggregation value once the window has expired and could > be configured to not send intermediate updates while the window was open. > > Of course this is just a helper for the users, since they can implement > it all themselves using the low-level API, as Matthias pointed out > already. Just seems there's recurring interest in this. > > Again, this only makes sense for processing time semantics. For > event-time semantics I find the arguments for "no final aggregation" > totally convincing. > > > Cheers, > > Michał > > > On 16/06/17 00:08, Matthias J. Sax wrote: >> Hi Paolo, >> >> This SO question might help, too: >> https://stackoverflow.com/questions/38935904/how-to-send-final-kafka-streams-aggregation-result-of-a-time-windowed-ktable >> >> For Streams, the basic model is based on "change" and we report updates >> to the "current" result immediately reducing latency to a minimum. >> >> Last, if you say it's going to fall into the next window, you won't get >> event time semantics but you fall back processing time semantics, that >> cannot provide exact results.... >> >> If you really want to trade-off correctness version getting (late) >> updates and want to use processing time semantics, you should configure >> WallclockTimestampExtractor and implement a "update deduplication" >> operator using table.toStream().transform(). You can attached a state to >> your transformer and store all update there (ie, newer update overwrite >> older updates). Punctuations allow you to emit "final" results for >> windows for which "window end time" passed. >> >> >> -Matthias >> >> On 6/15/17 9:21 AM, Paolo Patierno wrote: >>> Hi Eno, >>> >>> >>> regarding closing window I think that it's up to the streaming application. >>> I mean ... >>> >>> If I want something like I described, I know that a value outside my 5 >>> seconds window will be taken into account for the next processing (in the >>> next 5 seconds). I don't think I'm losing a record, I am ware that this >>> record will fall in the next "processing" window. Btw I'll take a look at >>> your article ! Thanks ! >>> >>> >>> Paolo >>> >>> >>> Paolo Patierno >>> Senior Software Engineer (IoT) @ Red Hat >>> Microsoft MVP on Windows Embedded & IoT >>> Microsoft Azure Advisor >>> >>> Twitter : @ppatierno<http://twitter.com/ppatierno> >>> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno> >>> Blog : DevExperience<http://paolopatierno.wordpress.com/> >>> >>> >>> ________________________________ >>> From: Eno Thereska <eno.there...@gmail.com> >>> Sent: Thursday, June 15, 2017 3:57 PM >>> To: users@kafka.apache.org >>> Subject: Re: Kafka Streams vs Spark Streaming : reduce by window >>> >>> Hi Paolo, >>> >>> Yeah, so if you want fewer records, you should actually "not" disable >>> cache. If you disable cache you'll get all the records as you described. >>> >>> About closing windows: if you close a window and a late record arrives that >>> should have been in that window, you basically lose the ability to process >>> that record. In Kafka Streams we are robust to that, in that we handle late >>> arriving records. There is a comparison here for example when we compare it >>> to other methods that depend on watermarks or triggers: >>> https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/ >>> <https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/> >>> >>> Eno >>> >>> >>>> On 15 Jun 2017, at 14:57, Paolo Patierno <ppatie...@live.com> wrote: >>>> >>>> Hi Emo, >>>> >>>> >>>> thanks for the reply ! >>>> >>>> Regarding the cache I'm already using CACHE_MAX_BYTES_BUFFERING_CONFIG = 0 >>>> (so disabling cache). >>>> >>>> Regarding the interactive query API (I'll take a look) it means that it's >>>> up to the application doing something like we have oob with Spark. >>>> >>>> May I ask what do you mean with "We don’t believe in closing windows" ? >>>> Isn't it much more code that user has to write for having the same result ? >>>> >>>> I'm exploring Kafka Streams and it's very powerful imho even because the >>>> usage is pretty simple but this scenario could have a lack against Spark. >>>> >>>> >>>> Thanks, >>>> >>>> Paolo. >>>> >>>> >>>> Paolo Patierno >>>> Senior Software Engineer (IoT) @ Red Hat >>>> Microsoft MVP on Windows Embedded & IoT >>>> Microsoft Azure Advisor >>>> >>>> Twitter : @ppatierno<http://twitter.com/ppatierno> >>>> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno> >>>> Blog : DevExperience<http://paolopatierno.wordpress.com/> >>>> >>>> >>>> ________________________________ >>>> From: Eno Thereska <eno.there...@gmail.com> >>>> Sent: Thursday, June 15, 2017 1:45 PM >>>> To: users@kafka.apache.org >>>> Subject: Re: Kafka Streams vs Spark Streaming : reduce by window >>>> >>>> Hi Paolo, >>>> >>>> That is indeed correct. We don’t believe in closing windows in Kafka >>>> Streams. >>>> You could reduce the number of downstream records by using record caches: >>>> http://docs.confluent.io/current/streams/developer-guide.html#record-caches-in-the-dsl >>>> >>>> <http://docs.confluent.io/current/streams/developer-guide.html#record-caches-in-the-dsl>. >>>> >>>> Alternatively you can just query the KTable whenever you want using the >>>> Interactive Query APIs (so when you query dictates what data you >>>> receive), see this >>>> https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/ >>>> >>>> <https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/> >>>> >>>> Thanks >>>> Eno >>>>> On Jun 15, 2017, at 2:38 PM, Paolo Patierno <ppatie...@live.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> >>>>> using the streams library I noticed a difference (or there is a lack of >>>>> knowledge on my side)with Apache Spark. >>>>> >>>>> Imagine following scenario ... >>>>> >>>>> >>>>> I have a source topic where numeric values come in and I want to check >>>>> the maximum value in the latest 5 seconds but ... putting the max value >>>>> into a destination topic every 5 seconds. >>>>> >>>>> This is what happens with reduceByWindow method in Spark. >>>>> >>>>> I'm using reduce on a KStream here that process the max value taking into >>>>> account previous values in the latest 5 seconds but the final value is >>>>> put into the destination topic for each incoming value. >>>>> >>>>> >>>>> For example ... >>>>> >>>>> >>>>> An application sends numeric values every 1 second. >>>>> >>>>> With Spark ... the source gets values every 1 second, process max in a >>>>> window of 5 seconds, puts the max into the destination every 5 seconds >>>>> (so when the window ends). If the sequence is 21, 25, 22, 20, 26 the >>>>> output will be just 26. >>>>> >>>>> With Kafka Streams ... the source gets values every 1 second, process max >>>>> in a window of 5 seconds, puts the max into the destination every 1 >>>>> seconds (so every time an incoming value arrives). Of course, if for >>>>> example the sequence is 21, 25, 22, 20, 26 ... the output will be 21, 25, >>>>> 25, 25, 26. >>>>> >>>>> >>>>> Is it possible with Kafka Streams ? Or it's something to do at >>>>> application level ? >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Paolo >>>>> >>>>> >>>>> Paolo Patierno >>>>> Senior Software Engineer (IoT) @ Red Hat >>>>> Microsoft MVP on Windows Embedded & IoT >>>>> Microsoft Azure Advisor >>>>> >>>>> Twitter : @ppatierno<http://twitter.com/ppatierno> >>>>> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno> >>>>> Blog : DevExperience<http://paolopatierno.wordpress.com/> >>> > > -- > Signature > <http://www.openbet.com/> Michal Borowiecki > Senior Software Engineer L4 > T: +44 208 742 1600 > > > +44 203 249 8448 > > > > E: michal.borowie...@openbet.com > W: www.openbet.com <http://www.openbet.com/> > > > OpenBet Ltd > > Chiswick Park Building 9 > > 566 Chiswick High Rd > > London > > W4 5XT > > UK > > > <https://www.openbet.com/email_promo> > > This message is confidential and intended only for the addressee. If you > have received this message in error, please immediately notify the > postmas...@openbet.com <mailto:postmas...@openbet.com> and delete it > from your system as well as any copies. The content of e-mails as well > as traffic data may be monitored by OpenBet for employment and security > purposes. To protect the environment please do not print this e-mail > unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building > 9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company > registered in England and Wales. Registered no. 3134634. VAT no. > GB927523612 >
signature.asc
Description: OpenPGP digital signature