Re: Question: Determining Total Recovery Time

2020-02-26 Thread Zhijiang
Hi Morgan, Your idea is great and i am also interested in it. I think it is valuable for some users to estimate the maximum throughput capacity based on certain metrics or models. But I am not quite sure whether it is feasible to do that based on existing metrics, at-least exist some limitation

Re: Question: Determining Total Recovery Time

2020-02-26 Thread Arvid Heise
Hi Morgan, doing it in a very general way sure is challenging. I'd assume that your idea of using the buffer usage has some shortcomings (which I don't know), but I also think it's a good starting point. Have you checked the PoolUsage metrics? [1] You could use them to detect the bottleneck and

Re: Question: Determining Total Recovery Time

2020-02-21 Thread Arvid Heise
Hi Morgan, sorry for the late reply. In general, that should work. You need to ensure that the same task is processing the same record though. Local copy needs to be state or else the last message would be lost upon restart. Performance will take a hit but if that is significant depends on the re

Re: Question: Determining Total Recovery Time

2020-02-11 Thread Morgan Geldenhuys
Thanks for the advice, i will look into it. Had a quick think about another simple solution but we would need a hook into the checkpoint process from the task/operator perspective, which I haven't looked into yet. It would work like this: - The sink operators (?) would keep a local copy of th

Re: Question: Determining Total Recovery Time

2020-02-10 Thread Arvid Heise
Hi Morgan, as Timo pointed out, there is no general solution, but in your setting, you could look at the consumer lag of the input topic after a crash. Lag would spike until all tasks restarted and reprocessing begins. Offsets are only committed on checkpoints though by default. Best, Arvid On

Re: Question: Determining Total Recovery Time

2020-02-04 Thread Timo Walther
Hi Morgan, as far as I know this is not possible mostly because measuring "till the point when the system catches up to the last message" is very pipeline/connector dependent. Some sources might need to read from the very beginning, some just continue from the latest checkpointed offset. Mea