Can you try starting from the savepoint, but telling Kafka to start from the latest offset?
(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?) On Fri, Jul 14, 2017 at 11:18 AM, Kien Truong <duckientru...@gmail.com> wrote: > Hi, > > Sorry for the version typo, I'm running 1.3.1. I did not test with 1.2.x. > > The jobs runs fine with almost 0 back-pressure if it's started from > scratch or if I reuse the kafka consumers group id without specifying the > safe point. > > Best regards, > Kien > On Jul 14, 2017, at 15:59, Stephan Ewen <se...@apache.org> wrote: >> >> Hi! >> >> Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master? >> >> Can you tell us whether this occurs only in 1.3.x and worked well in >> 1.2.x? >> If you just keep the job running without savepoint/restore, you do not >> get into backpressure situations? >> >> Thanks, >> Stephan >> >> >> On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <duckientru...@gmail.com> >> wrote: >> >>> Hi Fabian, >>> This happens to me even when the restore is immediate, so there's not >>> much data in Kafka to catch up (5 minutes max) >>> >>> Regards >>> Kien >>> On Jul 13, 2017, at 23:40, Fabian Hueske < fhue...@gmail.com> wrote: >>>> >>>> I would guess that this is quite usual because the job has to >>>> "catch-up" work. >>>> For example, if you took a save point two days ago and restore the job >>>> today, the input data of the last two days has been written to Kafka >>>> (assuming Kafka as source) and needs to be processed. >>>> The job will now read as fast as possible from Kafka to catch-up to the >>>> presence. This means the data is much fast ingested (as fast as Kafka can >>>> read and ship it) than during regular processing (as fast as your sources >>>> produce). >>>> The processing speed is bound by your Flink job which means there will >>>> be backpressure. >>>> >>>> Once the job caught-up, the backpressure should disappear. >>>> >>>> Best, Fabian >>>> >>>> 2017-07-13 15:48 GMT+02:00 Kien Truong <duckientru...@gmail.com>: >>>> >>>>> Hi all, >>>>> >>>>> I have one job where back-pressure is significantly higher after >>>>> resuming from a save point. >>>>> >>>>> Because that job makes heavy use of stateful functions with >>>>> RocksDBStateBackend , >>>>> >>>>> I'm suspecting that this is the cause of performance degradation. >>>>> >>>>> Does anyone encounter simillar issues or have any tips for debugging ? >>>>> >>>>> >>>>> I'm using Flink 1.3.2 with YARN in detached mode. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Kien >>>>> >>>>> >>>> >>