Hi Morgan, as Timo pointed out, there is no general solution, but in your setting, you could look at the consumer lag of the input topic after a crash. Lag would spike until all tasks restarted and reprocessing begins. Offsets are only committed on checkpoints though by default.
Best, Arvid On Tue, Feb 4, 2020 at 12:32 PM Timo Walther <twal...@apache.org> wrote: > Hi Morgan, > > as far as I know this is not possible mostly because measuring "till the > point when the system catches up to the last message" is very > pipeline/connector dependent. Some sources might need to read from the > very beginning, some just continue from the latest checkpointed offset. > > Measure things like that (e.g. for experiments) might require collecting > own metrics as part of your pipeline definition. > > Regards, > Timo > > > On 03.02.20 12:20, Morgan Geldenhuys wrote: > > Community, > > > > I am interested in determining the total time to recover for a Flink > > application after experiencing a partial failure. Let's assume a > > pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once > > guarantees enabled. > > > > Taking a look at the documentation > > ( > https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html), > > > one of the metrics which can be gathered is /recoveryTime/. However, as > > far as I can tell this is only the time taken for the system to go from > > an inconsistent state back into a consistent state, i.e. restarting the > > job. Is there any way of measuring the amount of time taken from the > > point when the failure occurred till the point when the system catches > > up to the last message that was processed before the outage? > > > > Thank you very much in advance! > > > > Regards, > > Morgan. > >