Re: Question: Determining Total Recovery Time

Arvid Heise Mon, 10 Feb 2020 23:58:13 -0800

Hi Morgan,

as Timo pointed out, there is no general solution, but in your setting, you
could look at the consumer lag of the input topic after a crash. Lag would
spike until all tasks restarted and reprocessing begins. Offsets are only
committed on checkpoints though by default.


Best,

Arvid

On Tue, Feb 4, 2020 at 12:32 PM Timo Walther <twal...@apache.org> wrote:

> Hi Morgan,
>
> as far as I know this is not possible mostly because measuring "till the
> point when the system catches up to the last message" is very
> pipeline/connector dependent. Some sources might need to read from the
> very beginning, some just continue from the latest checkpointed offset.
>
> Measure things like that (e.g. for experiments) might require collecting
> own metrics as part of your pipeline definition.
>
> Regards,
> Timo
>
>
> On 03.02.20 12:20, Morgan Geldenhuys wrote:
> > Community,
> >
> > I am interested in determining the total time to recover for a Flink
> > application after experiencing a partial failure. Let's assume a
> > pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once
> > guarantees enabled.
> >
> > Taking a look at the documentation
> > (
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html),
>
> > one of the metrics which can be gathered is /recoveryTime/. However, as
> > far as I can tell this is only the time taken for the system to go from
> > an inconsistent state back into a consistent state, i.e. restarting the
> > job. Is there any way of measuring the amount of time taken from the
> > point when the failure occurred till the point when the system catches
> > up to the last message that was processed before the outage?
> >
> > Thank you very much in advance!
> >
> > Regards,
> > Morgan.
>
>

Re: Question: Determining Total Recovery Time

Reply via email to