Flink checkpointing - exactly once guaranteed understanding

Kartik Kushwaha Fri, 09 Feb 2024 05:33:47 -0800

I am using flink checkpointing to restore states of my job. I am using
unaligned checkpointing with 100 ms as the checkpointing interval. I see
few events getting dropped that were sucessfully processed by the operators
or were in-flight that were yet to be captured by checkpoint. That is these
were new events which came into the pipeline between the previously
captured checkpoint state and the failure.
<https://stackoverflow.com/posts/77955164/timeline>


My project acknowledges(commits) back to the topic after the event read and
mongo ingestion. But the pipeline has transformation, enrichment and sink
operators after that. These missing events were read, ack'd back to the
topic and transformed successfully before failure and were not yet
checkpointed (withing the 100 ms interval between checkpoints) were dropped.

Pipeline: Source (solace topic, queue reader) --> [MongoWrite +
sourcecommit] --> transform --> enrich --> sink (solace topic)

Checkpoint: Unaligned, Exactly-once, 100ms interval, 10ms Min pause between
checkpoint

   1.

   I see that whenever runtime exception is thrown it triggers the close
   method in each of the functions one by one. Do we have to store the states
   which were not yet captured by the checkpoint before failure? What happens
   Network failures or task manager crash or any other abrupt failure?
   2.

   Or do we have to shift the source topic acknowledgment to the last ( but
   we will have to chain all this operators to run in a single thread and
   carry the bytearray message object from solace queue to do ack at the end).

Is there anything else Iam missing here?


Note: Sources and Sinks are fully Solace based in all the Flink pipelines (
queues and topics)

Flink checkpointing - exactly once guaranteed understanding

Reply via email to