Hi Morgan,
as far as I know this is not possible mostly because measuring "till the
point when the system catches up to the last message" is very
pipeline/connector dependent. Some sources might need to read from the
very beginning, some just continue from the latest checkpointed offset.
Measure things like that (e.g. for experiments) might require collecting
own metrics as part of your pipeline definition.
Regards,
Timo
On 03.02.20 12:20, Morgan Geldenhuys wrote:
Community,
I am interested in determining the total time to recover for a Flink
application after experiencing a partial failure. Let's assume a
pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once
guarantees enabled.
Taking a look at the documentation
(https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html),
one of the metrics which can be gathered is /recoveryTime/. However, as
far as I can tell this is only the time taken for the system to go from
an inconsistent state back into a consistent state, i.e. restarting the
job. Is there any way of measuring the amount of time taken from the
point when the failure occurred till the point when the system catches
up to the last message that was processed before the outage?
Thank you very much in advance!
Regards,
Morgan.