Hi Robert, Uncaught exceptions that cause the job to fall into a fail-and-restart loop is likewise to the corrupt record case I mentioned.
With exactly-once guarantees, the job will roll back to the last complete checkpoint, which "resets" the Flink consumer to some earlier Kafka partition offset. Eventually, that failing record will be processed again. Currently there is no way to manipulate the "reset" offset on restore from failure. That is strictly reset to the offset stored in the last complete checkpoint, otherwise exactly-once is violated. Rob wrote > Or maybe the recipe is to manually retrieve the record at > partitionX/offsetY for the group and then restart? This would not work, as exactly-once is achieved with the offsets that Flink stores in its checkpoints, not the offsets that are committed back to Kafka. Cheers, Gordon -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/