Re: Watermark Question on Failed Process

Timo Walther Tue, 03 Apr 2018 04:59:02 -0700

Hi Chengzhi,

if you emit a watermark even though there is still data with a lowertimestamp, you generate "late data" that either needs to be processed ina separate branch of your pipeline (see sideOutputLateData() [1]) orshould force your existing operators to update their previously emittedresults. The latter means holding state or the contents of your windowslonger (see allowedLateness() [1]). I think in general a processing timewatermark strategy might not be suitable for reprocessing. Either youparameterize your watermark generator such that you can pass informationthrough job parameters or you use another strategy such asBoundedOutOfOrdernessTimestampExtractor [2] and sinks that allowidempotent updates.


I hope this helps.

Regards,
Timo

[1]https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/windows.html#windows[2]https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/event_timestamp_extractors.html


Am 02.04.18 um 23:51 schrieb Chengzhi Zhao:

Hello, flink community,
I am using period watermark and extract the event time from eachrecords from files in S3. I am using the `TimeLagWatermarkGenerator`as it was mentioned in flink documentation.
Currently, a new watermark will be generated using processing time byfixed amount
override def getCurrentWatermark: Watermark = {
    new Watermark(System.currentTimeMillis() - maxTimeLag)
}
This would work fine as long as process is running. However, in caseof failures, I mean if there was some bad data or out of memoryoccurs, I need to stop the process and it will take me time to getback. If the maxTimeLag=3 hours, and it took me 12 hours to realizeand fix it.
My question is since I am using processing time as part of thewatermark, when flink resumed from failures, will some records mightbe ignored by the watermark? And what's the best practice to catchupand continue without losing data? Thanks!
Best,
Chengzhi

Re: Watermark Question on Failed Process

Reply via email to