I am using flink checkpointing to restore states of my job. I am using unaligned checkpointing with 100 ms as the checkpointing interval. I see few events getting dropped that were sucessfully processed by the operators or were in-flight that were yet to be captured by checkpoint. That is these were new events which came into the pipeline between the previously captured checkpoint state and the failure. <https://stackoverflow.com/posts/77955164/timeline>
My project acknowledges(commits) back to the topic after the event read and mongo ingestion. But the pipeline has transformation, enrichment and sink operators after that. These missing events were read, ack'd back to the topic and transformed successfully before failure and were not yet checkpointed (withing the 100 ms interval between checkpoints) were dropped. Pipeline: Source (solace topic, queue reader) --> [MongoWrite + sourcecommit] --> transform --> enrich --> sink (solace topic) Checkpoint: Unaligned, Exactly-once, 100ms interval, 10ms Min pause between checkpoint 1. I see that whenever runtime exception is thrown it triggers the close method in each of the functions one by one. Do we have to store the states which were not yet captured by the checkpoint before failure? What happens Network failures or task manager crash or any other abrupt failure? 2. Or do we have to shift the source topic acknowledgment to the last ( but we will have to chain all this operators to run in a single thread and carry the bytearray message object from solace queue to do ack at the end). Is there anything else Iam missing here? Note: Sources and Sinks are fully Solace based in all the Flink pipelines ( queues and topics)