Ty,

Usually what's done is to run a separate instance of the app to handle the
re-ingestion of the historic data while another instance is processing live
data. That way the backfill job won't be confused by observing events with
recent timestamps -- it will only see the historic data. But you will need
to either provide the data roughly sorted by timestamp, or execute the
backfill job in batch execution mode.

Regards,
David

On Tue, Apr 12, 2022 at 5:28 PM Ty Brooks <tbro...@sovrn.com> wrote:

> Hi devs,
>
> I’ve got a question about how to design a Flink app that can handle the
> reprocessing of historical data. As a quick background, I’m building an ETL
> / data aggregator application, and one of the requirements is that if in
> the future we want to extend the system by adding some new metrics or
> dimensions, we’ll have the option to reprocess historical data in order to
> generate backfill (assuming the historical data already has the fields
> we’ll be adding to our aggregations, of course).
>
> The data source for my application is Kafka, so my naive approach here
> would just be to re-ingest the older events into Kafka and let the stream
> processor handle the rest. But if I’m understanding Flink’s windowing and
> watermarks correctly for an event-time based app, there doesn’t really seem
> to be a way of revisiting older data. If you’re using event-time windowing,
> then any window for which (end window range + window allowed lateness +
> watermark strategy out of orderness) <= max_timestamp is permanently
> finished since max_timestamp is always increasing.
>
> And that all does make sense, but I’m not sure where that leaves me for my
> requirement to support backfill. Is the alternative to just recreate the
> app every time and try to ensure that all the historical data gets
> processed before any new, live data?
>
> Best,
> Ty
>
>
>

Reply via email to