Hello,

in the last few days I tried to create my first real-time analytics job in
Flink. The approach is kappa-architecture-like, so I have my raw data on
Kafka where we receive a message for every change of state of any entity.

So the messages are of the form

(id,newStatus, timestamp)

We want to compute, for every time window, the count of items in a given
status. So the output should be of the form

(outputTimestamp, state1:count1,state2:count2 ...)

or equivalent. These rows should contain, at any given time, the count of
the items in a given status, where the status associated to an Id is the
most recent message observed for that id. The status for an id should be
counted in any case, even if the event is way older than those getting
processed. So the sum of all the counts should be equal to the number of
different IDs observed in the system. The following step could be
forgetting about the items in a final item after a while, but this is not a
strict requirement right now.

This will be written on elasticsearch and then queried.

I tried many different paths and none of them completely satisfied the
requirement. Using a sliding window I could easily achieve the expected
behaviour, except that when the beginning of the sliding window surpassed
the timestamp of an event, it was lost for the count, as you may expect.
Others approaches failed to be consistent when working with a backlog
because I did some tricks with keys and timestamps that failed when the
data was processed all at once.

So I would like to know, even at an high level, how should I approach this
problem. It looks like a relatively common use-case but the fact that the
relevant information for a given ID must be retained indefinitely to count
the entities correctly creates a lot of problems.

Thank you in advance,

Simone

Reply via email to