Re: Question: Determining Total Recovery Time

Timo Walther Tue, 04 Feb 2020 03:32:17 -0800

Hi Morgan,

as far as I know this is not possible mostly because measuring "till thepoint when the system catches up to the last message" is verypipeline/connector dependent. Some sources might need to read from thevery beginning, some just continue from the latest checkpointed offset.

Measure things like that (e.g. for experiments) might require collectingown metrics as part of your pipeline definition.


Regards,
Timo


On 03.02.20 12:20, Morgan Geldenhuys wrote:

Community,
I am interested in determining the total time to recover for a Flinkapplication after experiencing a partial failure. Let's assume apipeline consisting of Kafka -> Flink -> Kafka with Exactly-Onceguarantees enabled.
Taking a look at the documentation(https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html),one of the metrics which can be gathered is /recoveryTime/. However, asfar as I can tell this is only the time taken for the system to go froman inconsistent state back into a consistent state, i.e. restarting thejob. Is there any way of measuring the amount of time taken from thepoint when the failure occurred till the point when the system catchesup to the last message that was processed before the outage?
Thank you very much in advance!

Regards,
Morgan.

Re: Question: Determining Total Recovery Time

Reply via email to