Or as an example we have a 5 minutes window and lateness of 5 minutes. We have the following events in the logs 10:00:01 PM ----> Already pushed to Kafka 10:00:30 PM ----> Already pushed to Kafka 10:01:00 PM ----> Already pushed to Kafka 10:03:45 PM ----> Already pushed to Kafka 10:04:00 PM ----> Log agent crashed for 30 minutes not delivered to Kafla yet 10:05:10 PM ----> Pushed to Kafka cause I came from a log agent that isn't dead.
Flink window of 10:00:00 10:00:01 PM ----> Received 10:00:30 PM ----> Received 10:01:00 PM ----> Received 10:03:45 PM ----> Received 10:04:00 PM ----> Still nothing.... Flink window of 10:00:00 5 lateness minutes are up. 10:00:01 PM ----> Counted 10:00:30 PM ----> Counted 10:01:00 PM ----> Counted 10:03:45 PM ----> Counted 10:04:00 PM ----> Still nothing.... Flink window of 10:05:00 started.... 10:05:10 PM.----> I'm new cause I came from a log agent that isn't dead. 10:04:00 PM ----> Still nothing.... Flink window of 10:05:00 5 lateness minutes are up. 10:05:10 PM.----> I have been counted, I'm happy! 10:04:00 PM ----> Still nothing.... And so on... Flink window of 10:30:00 started.... 10:04:00 PM ----> Hi guys, sorry I'm late 30 minutes, I ran into log agent problems. Sorry you are late, you missed the Flink bus. On Fri, 26 Nov 2021 at 10:53, John Smith <java.dev....@gmail.com> wrote: > Ok, > > So processing time we get 100% accuracy because we don't care when the > event comes, we just count and move along. > > As for event time processing, what I meant to say is if for example if the > log shipper is late at pushing events into Kafka, Flink will not notice > this, the watermarks will keep watermarking. So given that, let's say we > have a window of 5 minutes and a lateness of 5 minutes, it means we will > see counts on the "dashboard" every 10 minutes. But say the log shipper > fails/falls behind for 30 minutes or more, the Flink Kafka consumer will > simply not see any events and it will continue chugging along, after 30 > minutes a late event comes in at 2 windows already too late, that event is > discarded. > > Or did I miss the point on the last part? > > > > On Fri, 26 Nov 2021 at 09:38, Schwalbe Matthias < > matthias.schwa...@viseca.ch> wrote: > >> Actually not, because processing-time does not matter at all. >> >> Event-time timers are always compared to watermark-time progress. >> >> If system happens to be compromised for (say) 4 hours, also watermarks >> won’t progress, hence the windows get not evicted and wait for watermarks >> to pick up from when the system crashed. >> >> >> >> Your watermark strategy can decide how strict you handle time progress: >> >> - Super strict: the watermark time indicates that there will be no >> events with an older timestamp >> - Semi strict: you accept late events and give a time-range when this >> can happen (still processing time put aside) >> - You need to configure acceptable lateness in your windowing >> operator >> - Accepted lateness implies higher overall latency >> - Custom strategy >> - Use a combination of accepted lateness and a custom trigger in >> your windowing operator >> - The trigger decide when and how often window results are emitted >> - The following operator would the probably implement some >> idempotence/updating scheme for the window values >> - This way you get immediate low latency results and allow for >> later corrections if late events arrive >> >> >> >> My favorite source on this is Tyler Akidau’s book [1] and the excerpt >> blog: [2] [3] >> >> I believe his code uses Beam, but the same ideas can be implemented >> directly in Flink API >> >> >> >> [1] https://www.oreilly.com/library/view/streaming-systems/9781491983867/ >> >> [2] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/ >> >> [3] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/ >> >> >> >> … happy to discuss further 😊 >> >> >> >> Thias >> >> >> >> >> >> >> >> *From:* John Smith <java.dev....@gmail.com> >> *Sent:* Freitag, 26. November 2021 14:09 >> *To:* Schwalbe Matthias <matthias.schwa...@viseca.ch> >> *Cc:* Caizhi Weng <tsreape...@gmail.com>; user <user@flink.apache.org> >> *Subject:* Re: Windows and data loss. >> >> >> >> But if we use event time, if a failure happens potentially those events >> can't be delivered in their windo they will be dropped if they come after >> the lateness and watermark settings no? >> >> >> >> >> >> On Fri, 26 Nov 2021 at 02:35, Schwalbe Matthias < >> matthias.schwa...@viseca.ch> wrote: >> >> Hi John, >> >> >> >> Going with processing time is perfectly sound if the results meet your >> requirements and you can easily live with events misplaced into the wrong >> time window. >> >> This is also quite a bit cheaper resource-wise. >> >> However you might want to keep in mind situations when things break down >> (network interrupt, datacenter flooded etc. 😊). With processing time >> events count into the time window when processed, with event time they >> count into the time window when originally created a the source … even if >> processed much later … >> >> >> >> Thias >> >> >> >> >> >> >> >> *From:* John Smith <java.dev....@gmail.com> >> *Sent:* Freitag, 26. November 2021 02:55 >> *To:* Schwalbe Matthias <matthias.schwa...@viseca.ch> >> *Cc:* Caizhi Weng <tsreape...@gmail.com>; user <user@flink.apache.org> >> *Subject:* Re: Windows and data loss. >> >> >> >> Well what I'm thinking for 100% accuracy no data loss just to base the >> count on processing time. So whatever arrives in that window is counted. If >> I get some events of the "current" window late and they go into another >> window it's ok. >> >> My pipeline is like so.... >> >> browser(user)----->REST API------>log file------>Filebeat------>Kafka (18 >> partitions)----->flink----->destination >> >> Filebeat inserts into Kafka it's kindof a big bucket of "logs" which I >> use flink to filter the specific app and do the counts. The logs are round >> robin into the topic/partitions. Where I FORSEE a delay is Filebeat can't >> push fast enough into Kafka AND/OR the flink consumer has not read all >> events for that window from all partitions. >> >> >> >> On Thu, 25 Nov 2021 at 11:28, Schwalbe Matthias < >> matthias.schwa...@viseca.ch> wrote: >> >> Hi John, >> >> >> >> … just a short hint: >> >> With datastream API you can >> >> - hand-craft a trigger that decides when an how often emit >> intermediate, punctual and late window results, and when to evict the >> window and stop processing late events >> - in order to process late event you also need to specify for how >> long you will extend the window processing (or is that done in the trigger >> … I don’t remember right know) >> - overall window state grows, if you extend window processing to >> after it is finished … >> >> >> >> Hope this helps 😊 >> >> >> >> Thias >> >> >> >> *From:* Caizhi Weng <tsreape...@gmail.com> >> *Sent:* Donnerstag, 25. November 2021 02:56 >> *To:* John Smith <java.dev....@gmail.com> >> *Cc:* user <user@flink.apache.org> >> *Subject:* Re: Windows and data loss. >> >> >> >> Hi! >> >> >> >> Are you using the datastream API or the table / SQL API? I don't know if >> datastream API has this functionality, but in table / SQL API we have the >> following configurations [1]. >> >> - table.exec.emit.late-fire.enabled: Emit window results for late >> records; >> - table.exec.emit.late-fire.delay: How often shall we emit results >> for late records (for example, once per 10 minutes or for every record). >> >> >> >> [1] >> https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214 >> >> >> >> John Smith <java.dev....@gmail.com> 于2021年11月25日周四 上午12:45写道: >> >> Hi I understand that when using windows and having set the watermarks and >> lateness configs. That if an event comes late it is lost and we can >> output it to side output. >> >> But wondering is there a way to do it without the loss? >> >> I'm guessing an "all" window with a custom trigger that just fires X >> period and whatever is on that bucket is in that bucket? >> >> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und >> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die >> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann, >> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und >> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir >> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie >> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung >> dieser Informationen ist streng verboten. >> >> This message is intended only for the named recipient and may contain >> confidential or privileged information. As the confidentiality of email >> communication cannot be guaranteed, we do not accept any responsibility for >> the confidentiality and the intactness of this message. If you have >> received it in error, please advise the sender by return e-mail and delete >> this message and any attachments. Any unauthorised use or dissemination of >> this information is strictly prohibited. >> >> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und >> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die >> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann, >> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und >> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir >> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie >> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung >> dieser Informationen ist streng verboten. >> >> This message is intended only for the named recipient and may contain >> confidential or privileged information. As the confidentiality of email >> communication cannot be guaranteed, we do not accept any responsibility for >> the confidentiality and the intactness of this message. If you have >> received it in error, please advise the sender by return e-mail and delete >> this message and any attachments. Any unauthorised use or dissemination of >> this information is strictly prohibited. >> >> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und >> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die >> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann, >> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und >> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir >> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie >> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung >> dieser Informationen ist streng verboten. >> >> This message is intended only for the named recipient and may contain >> confidential or privileged information. As the confidentiality of email >> communication cannot be guaranteed, we do not accept any responsibility for >> the confidentiality and the intactness of this message. If you have >> received it in error, please advise the sender by return e-mail and delete >> this message and any attachments. Any unauthorised use or dissemination of >> this information is strictly prohibited. >> >