Re: Question: Determining Total Recovery Time

Arvid Heise Fri, 21 Feb 2020 05:44:21 -0800

Hi Morgan,

sorry for the late reply. In general, that should work. You need to ensure
that the same task is processing the same record though.


Local copy needs to be state or else the last message would be lost upon
restart. Performance will take a hit but if that is significant depends on
the remaining pipeline.

Btw, at least once should be enough for that, since you implicitly
deduplicating.

Best,

Arvid

On Tue, Feb 11, 2020 at 11:24 AM Morgan Geldenhuys <
[email protected]> wrote:

> Thanks for the advice, i will look into it.
>
> Had a quick think about another simple solution but we would need a hook
> into the checkpoint process from the task/operator perspective, which I
> haven't looked into yet. It would work like this:
>
> - The sink operators (?) would keep a local copy of the last message
> processed (or digest?), the current timestamp, and a boolean value
> indicating whether or not the system is in recovery or not.
> - While not in recovery, update the local copy and timestamp with each new
> event processed.
> - When a failure is detected and the taskmanagers are notified to
> rollback, we use the hook into this process to switch the boolean value to
> true.
> - While true, it compares each new message with the last one processed
> before the recovery process was initiated.
> - When a match is found, the difference between the previous and current
> timestamp is calculated and outputted as a custom metric and the boolean is
> reset to false.
>
> From here, the mean total recovery time could be calculated across the
> operators. Not sure how it would impact on performance, but i doubt it
> would be significant. We would need to ensure exactly once so that the
> message would be guaranteed to be seen again. thoughts?
>
> On 11.02.20 08:57, Arvid Heise wrote:
>
> Hi Morgan,
>
> as Timo pointed out, there is no general solution, but in your setting,
> you could look at the consumer lag of the input topic after a crash. Lag
> would spike until all tasks restarted and reprocessing begins. Offsets are
> only committed on checkpoints though by default.
>
> Best,
>
> Arvid
>
> On Tue, Feb 4, 2020 at 12:32 PM Timo Walther <[email protected]> wrote:
>
>> Hi Morgan,
>>
>> as far as I know this is not possible mostly because measuring "till the
>> point when the system catches up to the last message" is very
>> pipeline/connector dependent. Some sources might need to read from the
>> very beginning, some just continue from the latest checkpointed offset.
>>
>> Measure things like that (e.g. for experiments) might require collecting
>> own metrics as part of your pipeline definition.
>>
>> Regards,
>> Timo
>>
>>
>> On 03.02.20 12:20, Morgan Geldenhuys wrote:
>> > Community,
>> >
>> > I am interested in determining the total time to recover for a Flink
>> > application after experiencing a partial failure. Let's assume a
>> > pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once
>> > guarantees enabled.
>> >
>> > Taking a look at the documentation
>> > (
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html),
>>
>> > one of the metrics which can be gathered is /recoveryTime/. However, as
>> > far as I can tell this is only the time taken for the system to go from
>> > an inconsistent state back into a consistent state, i.e. restarting the
>> > job. Is there any way of measuring the amount of time taken from the
>> > point when the failure occurred till the point when the system catches
>> > up to the last message that was processed before the outage?
>> >
>> > Thank you very much in advance!
>> >
>> > Regards,
>> > Morgan.
>>
>>
>

Re: Question: Determining Total Recovery Time

Reply via email to