Thanks Andrew. We did consider this solution too. Unfortunately we do not have permissions to generate artificial kafka events in our ecosystem.
Dario, Thanks for your inputs. We will give your design a try. Due the number of events being processed per window, we are using incremental aggregate function https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#processwindowfunction-with-incremental-aggregation. Do you think we can use KeyedCoProcessFunction in this design? Thanks, Shilpa On Mon, May 9, 2022 at 9:31 AM Dario Heinisch <dario.heini...@gmail.com> wrote: > It depends on the user case, in Shilpa's use case it is about users so > the user ids are probably know beforehand. > > https://dpaste.org/cRe3G <= This is an example with out an window but > essentially Shilpa you would be reregistering the timers every time they > fire. > You would also have to ingest the user ids before hand into your pipeline, > so that if a user never has any event he still gets a notification. So > probably on startup ingest the user ids with a single source > from the DB. > > My example is pretty minimal but the idea in your case stays the same: > > - key by user > - have a co-process function to init the state with the user ids > - reregister the timers every time they fire > - use `env.getConfig().setAutoWatermarkInterval(1000)` to move the event > time forward even if there is no data coming in (this is what you are > probably looking for!!) > - then collect an Optionable/CustomStruct/Null or so depending on if data > is present or not > - and then u can check whether the event was triggered because there was > data or because there wasn't data > > Best regards, > > Dario > On 09.05.22 15:19, Andrew Otto wrote: > > This sounds similar to a non streaming problem we had at WMF. We ingest > all event data from Kafka into HDFS/Hive and partition the Hive tables in > hourly directories. If there are no events in a Kafka topic for a given > hour, we have no way of knowing if the hour has been ingested > successfully. For all we know, the upstream producer pipeline might be > broken. > > We solved this by emitting artificial 'canary' events into each topic > multiple times an hour. The canary events producer uses the same code > pathways and services that (most) of our normal event producers do. Then, > when ingesting into Hive, we filter out the canary events. The ingestion > code has work to do and can mark an hour as complete, but still end up > writing no events to it. > > Perhaps you could do the same? Always emit artificial events, and filter > them out in your windowing code? The window should still fire since it will > always have events, even if you don't use them? > > > > > On Mon, May 9, 2022 at 8:55 AM Shilpa Shankar <sshan...@bandwidth.com> > wrote: > >> Hello, >> We are building a flink use case where we are consuming from a kafka >> topic and performing aggregations and generating alerts based on average, >> max, min thresholds. We also need to notify the users when there are 0 >> events in a Tumbling Event Time Windows. We are having trouble coming up >> with a solution to do the same. The options we considered are below, please >> let us know if there are other ideas we haven't looked into. >> >> [1] Querable State : Save the keys in each of the Process Window >> Functions. Query the state from an external application and alert when a >> key is missing after the 20min time interval has expired. We see Queryable >> state feature is being deprecated in the future. We do not want to go down >> this path when we already know there is an EOL for it. >> >> [2] Use Processing Time Windows : Using Processing time instead of Event >> time would have been an option if our downstream applications would send >> out events in real time. Maintenances of the downstream applications, >> delays etc would result in a lot of data loss which is undesirable. >> >> Flink version : 1.14.3 >> >> Thanks, >> Shilpa >> >