Or as an example we have a 5 minutes window and lateness of 5 minutes.

We have the following events in the logs
10:00:01 PM ----> Already pushed to Kafka
10:00:30 PM ----> Already pushed to Kafka
10:01:00 PM ----> Already pushed to Kafka
10:03:45 PM ----> Already pushed to Kafka
10:04:00 PM ----> Log agent crashed for 30 minutes not delivered to Kafla
yet
10:05:10 PM ----> Pushed to Kafka cause I came from a log agent that isn't
dead.

Flink window of 10:00:00
10:00:01 PM ----> Received
10:00:30 PM ----> Received
10:01:00 PM ----> Received
10:03:45 PM ----> Received
10:04:00 PM ----> Still nothing....

Flink window of 10:00:00 5 lateness minutes are up.
10:00:01 PM ----> Counted
10:00:30 PM ----> Counted
10:01:00 PM ----> Counted
10:03:45 PM ----> Counted
10:04:00 PM ----> Still nothing....

Flink window of 10:05:00 started....
10:05:10 PM.----> I'm new cause I came from a log agent that isn't dead.
10:04:00 PM ----> Still nothing....

Flink window of 10:05:00 5 lateness minutes are up.
10:05:10 PM.----> I have been counted, I'm happy!
10:04:00 PM ----> Still nothing....

And so on...

Flink window of 10:30:00 started....
10:04:00 PM ----> Hi guys, sorry I'm late 30 minutes, I ran into log agent
problems. Sorry you are late, you missed the Flink bus.


On Fri, 26 Nov 2021 at 10:53, John Smith <java.dev....@gmail.com> wrote:

> Ok,
>
> So processing time we get 100% accuracy because we don't care when the
> event comes, we just count and move along.
>
> As for event time processing, what I meant to say is if for example if the
> log shipper is late at pushing events into Kafka, Flink will not notice
> this, the watermarks will keep watermarking. So given that, let's say we
> have a window of 5 minutes and a lateness of 5 minutes, it means we will
> see counts on the "dashboard" every 10 minutes. But say the log shipper
> fails/falls behind for 30 minutes or more, the Flink Kafka consumer will
> simply not see any events and it will continue chugging along, after 30
> minutes a late event comes in at 2 windows already too late, that event is
> discarded.
>
> Or did I miss the point on the last part?
>
>
>
> On Fri, 26 Nov 2021 at 09:38, Schwalbe Matthias <
> matthias.schwa...@viseca.ch> wrote:
>
>> Actually not, because processing-time does not matter at all.
>>
>> Event-time timers are always compared to watermark-time progress.
>>
>> If system happens to be compromised for (say) 4 hours, also watermarks
>> won’t progress, hence the windows get not evicted and wait for watermarks
>> to pick up from when the system crashed.
>>
>>
>>
>> Your watermark strategy can decide how strict you handle time progress:
>>
>>    - Super strict: the watermark time indicates that there will be no
>>    events with an older timestamp
>>    - Semi strict: you accept late events and give a time-range when this
>>    can happen (still processing time put aside)
>>       - You need to configure acceptable lateness in your windowing
>>       operator
>>       - Accepted lateness implies higher overall latency
>>    - Custom strategy
>>       - Use a combination of accepted lateness and a custom trigger in
>>       your windowing operator
>>       - The trigger decide when and how often window results are emitted
>>       - The following operator would the probably implement some
>>       idempotence/updating scheme for the window values
>>       - This way you get immediate low latency results and allow for
>>       later corrections if late events arrive
>>
>>
>>
>> My favorite source on this is Tyler Akidau’s book [1] and the excerpt
>> blog: [2] [3]
>>
>> I believe his code uses Beam, but the same ideas can be implemented
>> directly in Flink API
>>
>>
>>
>> [1] https://www.oreilly.com/library/view/streaming-systems/9781491983867/
>>
>> [2] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
>>
>> [3] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/
>>
>>
>>
>> … happy to discuss further 😊
>>
>>
>>
>> Thias
>>
>>
>>
>>
>>
>>
>>
>> *From:* John Smith <java.dev....@gmail.com>
>> *Sent:* Freitag, 26. November 2021 14:09
>> *To:* Schwalbe Matthias <matthias.schwa...@viseca.ch>
>> *Cc:* Caizhi Weng <tsreape...@gmail.com>; user <user@flink.apache.org>
>> *Subject:* Re: Windows and data loss.
>>
>>
>>
>> But if we use event time, if a failure happens potentially those events
>> can't be delivered in their windo they will be dropped if they come after
>> the lateness and watermark settings no?
>>
>>
>>
>>
>>
>> On Fri, 26 Nov 2021 at 02:35, Schwalbe Matthias <
>> matthias.schwa...@viseca.ch> wrote:
>>
>> Hi John,
>>
>>
>>
>> Going with processing time is perfectly sound if the results meet your
>> requirements and you can easily live with events misplaced into the wrong
>> time window.
>>
>> This is also quite a bit cheaper resource-wise.
>>
>> However you might want to keep in mind situations when things break down
>> (network interrupt, datacenter flooded etc. 😊). With processing time
>> events count into the time window when processed, with event time they
>> count into the time window when originally created a the source … even if
>> processed much later …
>>
>>
>>
>> Thias
>>
>>
>>
>>
>>
>>
>>
>> *From:* John Smith <java.dev....@gmail.com>
>> *Sent:* Freitag, 26. November 2021 02:55
>> *To:* Schwalbe Matthias <matthias.schwa...@viseca.ch>
>> *Cc:* Caizhi Weng <tsreape...@gmail.com>; user <user@flink.apache.org>
>> *Subject:* Re: Windows and data loss.
>>
>>
>>
>> Well what I'm thinking for 100% accuracy no data loss just to base the
>> count on processing time. So whatever arrives in that window is counted. If
>> I get some events of the "current" window late and they go into another
>> window it's ok.
>>
>> My pipeline is like so....
>>
>> browser(user)----->REST API------>log file------>Filebeat------>Kafka (18
>> partitions)----->flink----->destination
>>
>> Filebeat inserts into Kafka it's kindof a big bucket of "logs" which I
>> use flink to filter the specific app and do the counts. The logs are round
>> robin into the topic/partitions. Where I FORSEE a delay is Filebeat can't
>> push fast enough into Kafka AND/OR the flink consumer has not read all
>> events for that window from all partitions.
>>
>>
>>
>> On Thu, 25 Nov 2021 at 11:28, Schwalbe Matthias <
>> matthias.schwa...@viseca.ch> wrote:
>>
>> Hi John,
>>
>>
>>
>> … just a short hint:
>>
>> With datastream API you can
>>
>>    - hand-craft a trigger that decides when an how often emit
>>    intermediate, punctual and late window results, and when to evict the
>>    window and stop processing late events
>>    - in order to process late event you also need to specify for how
>>    long you will extend the window processing (or is that done in the trigger
>>    … I don’t remember right know)
>>    - overall window state grows, if you extend window processing to
>>    after it is finished …
>>
>>
>>
>> Hope this helps 😊
>>
>>
>>
>> Thias
>>
>>
>>
>> *From:* Caizhi Weng <tsreape...@gmail.com>
>> *Sent:* Donnerstag, 25. November 2021 02:56
>> *To:* John Smith <java.dev....@gmail.com>
>> *Cc:* user <user@flink.apache.org>
>> *Subject:* Re: Windows and data loss.
>>
>>
>>
>> Hi!
>>
>>
>>
>> Are you using the datastream API or the table / SQL API? I don't know if
>> datastream API has this functionality, but in table / SQL API we have the
>> following configurations [1].
>>
>>    - table.exec.emit.late-fire.enabled: Emit window results for late
>>    records;
>>    - table.exec.emit.late-fire.delay: How often shall we emit results
>>    for late records (for example, once per 10 minutes or for every record).
>>
>>
>>
>> [1]
>> https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214
>>
>>
>>
>> John Smith <java.dev....@gmail.com> 于2021年11月25日周四 上午12:45写道:
>>
>> Hi I understand that when using windows and having set the watermarks and
>> lateness configs. That if an event comes late it is lost and we can
>> output it to side output.
>>
>> But wondering is there a way to do it without the loss?
>>
>> I'm guessing an "all" window with a custom trigger that just fires X
>> period and whatever is on that bucket is in that bucket?
>>
>> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
>> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
>> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
>> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
>> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
>> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
>> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
>> dieser Informationen ist streng verboten.
>>
>> This message is intended only for the named recipient and may contain
>> confidential or privileged information. As the confidentiality of email
>> communication cannot be guaranteed, we do not accept any responsibility for
>> the confidentiality and the intactness of this message. If you have
>> received it in error, please advise the sender by return e-mail and delete
>> this message and any attachments. Any unauthorised use or dissemination of
>> this information is strictly prohibited.
>>
>> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
>> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
>> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
>> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
>> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
>> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
>> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
>> dieser Informationen ist streng verboten.
>>
>> This message is intended only for the named recipient and may contain
>> confidential or privileged information. As the confidentiality of email
>> communication cannot be guaranteed, we do not accept any responsibility for
>> the confidentiality and the intactness of this message. If you have
>> received it in error, please advise the sender by return e-mail and delete
>> this message and any attachments. Any unauthorised use or dissemination of
>> this information is strictly prohibited.
>>
>> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
>> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
>> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
>> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
>> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
>> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
>> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
>> dieser Informationen ist streng verboten.
>>
>> This message is intended only for the named recipient and may contain
>> confidential or privileged information. As the confidentiality of email
>> communication cannot be guaranteed, we do not accept any responsibility for
>> the confidentiality and the intactness of this message. If you have
>> received it in error, please advise the sender by return e-mail and delete
>> this message and any attachments. Any unauthorised use or dissemination of
>> this information is strictly prohibited.
>>
>

Reply via email to