Hi,

I have a use case where we are reading events from kinesis stream.The event
can look like this
Event {
  event_id,
  transaction_id
  key1,
  key2,
  value,
  timestamp,
  some other fields...
}

We want to aggregate the values per key for all events we have seen till
now (as simple as "select key1, key2, sum(value) from events group by key1,
key2key."). For this I have created a simple flink job which uses
flink-kinesis connector and applies keyby() and sum() on the incoming
events. I am facing two challenges here:

1) The incoming events can have duplicates. How to maintain exactly once
processing here, as processing duplicate events can give me erroneous
result? The field transaction_id will be unique for each events. If two
events have same transaction_id, we can say that they are duplicates (By
duplicates here, I don't just mean the retried ones. The same message can
be present in kinesis with different sequence number. I am not sure if
flink-kinesis connector can handle that, as it is using KCL underlying
which I assume doesn't take care of it)

2) There can be the the cases where the value has been updated for a key
after processing the event and we may want to reprocess those events with
new value. Since this is just a value change, the transaction_id will be
same. The idempotency logic will not allow to reprocess the events. What
are the ways to handle such scenarios in flink?

Thanks
Pooja


-- 
Warm Regards,
Pooja Agrawal

Reply via email to