Hi Flink Community,

I have a few questions regarding the new KafkaSource and event time, which I 
wasn't able to answer myself via checking the docs, but please point me to the 
right pages in case I missed something. I'm not entirely whether my knowledge 
entirely holds for the new KafkaSource, because it is rather new to me. Please 
correct me, when I'm making false assumptions.

I make the following assumptions, which you can hopefully confirm.

a1) The KafkaSource by default uses the Kafka timestamp to extract watermarks
a2) There is a mechanism in place to synchronise the consumption from all Kafka 
partitions based on the event time / watermarks (important in case we are 
facing a noticeable consumer lag)

Now lets assume the following scenario

s1) The Kafka timestamp doesn't contain the event time, but only the timestamp 
when the record was written to the topic
s2) The event timestamp is buried deep in the message payload

My question is now what is best practice to extract the event time timestamp? I 
see the following options to handle this problem.

o1) Implement the parsing of the message in the Deserializer and directly 
extract the event time timestamp. This comes with the drawback putting a 
complex parsing logic into the Deserializer.

o2) Let the producer set the Kafka timestamp to the actual event time. This 
comes with the following drawbacks:
  - The Kafka write timestamp is lost
  - The producer (which is in our case only forwarding messages) needs to 
extract the timestamp on its own from the message payload

o3) Leave the Kafka timestamp as it is and create Watermarks downstream in the 
pipeline after the parsing. This comes with the following drawbacks:
  - Scenarios where event time and Kafka timestamp are drifting apart are 
possible and therefor the synchronisation mechanism described in a2 isn't 
really guaranteed to work properly.

Currently we are using o3 and I'm investigating, whether o1 and o2 are better 
alternatives, which doesn't involve sacrificing a2. Can you share any best 
practices here? Is it recommended to always use the Kafka timestamp to hold the 
event time? Are there any other points I missed, which makes one of the options 
preferable?

Thank you for taking the time.

Kind Regards,
Niklas

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to