Hello, Searching some expertise on exception handling with checkpointing and streaming. Let’s say some bad data flows into your Flink application and causes an exception you are not expecting. That exception will bubble up, ending up in killing the respective task and the app will not be able to progress. Eventually the topology will restart (if configured so) from the previous successful checkpoint/savepoint and will hit that broken message again, resulting in a loop.
If we don’t know how to process a given message we would like our topology to progress and sink that message into some sort of dead-letter kafka topic. We have seen some recommendation on using Side Outputs<https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/datastream/side_output/> for that but it looks like things can easily get messy with that. We would need to extend all our operators with try-catch blocks and side output messages within the catch. Then we would need to aggregate all those side outputs and decide what to do with them. If we want to output exactly the inbound message that originated the exception it requires some extra logic as well since our operators have different output types. On top of that it looks like the type of operators which allow side outputs is limited.https://stackoverflow.com/questions/52411705/flink-whats-the-best-way-to-handle-exceptions-inside-flink-jobs Wondering if there is a better way to do it? We would like to avoid our topology to be stuck because one message originates some unpredicted exception and we would also like to have as well the possibility to replay it once we put a fix in place, hence the dead letter topic idea. Regards, José Brandão