Hi Jared, I think both approaches should work. The source that integrates the finite batch input and the stream might be more comfortable to use. As you said, the challenge would be to identify the exact point when to switch from one input to the other.
One thing to consider when reading finite batch data into a streaming job is to ensure that the data is read in event-time order and that all parallel source make approximately the same progress. Otherwise, data might be dropped as late or Flink has to buffer a lot of data. AFAIK, there is not much tooling for these use cases available yet. Best, Fabian 2017-01-18 18:35 GMT+01:00 Jared Stehler < jared.steh...@intellifylearning.com>: > I have a use case where I need to start a stream replaying historical > data, and then have it continue processing on a live kafka source, and am > looking for guidance / best practices for implementation. > > Basically, I want to start up a new “version” of the stream job, and have > it process each element from a specified range of historical data, before > continuing to process records from the live source. I'd then let the job > catch up to current time and atomically switch to it as the “live/current” > job and then shut down the previously running one. The issue I’m struggling > with is that switch-over from the historical source to the live source. Is > there a way to build a composite stream source which would emit records > from a bounded data set before consuming from a kafka topic, for example? > Or would I be better off stopping the job once it’s read through the > historical set, switching it’s source to the live topic and re-starting it? > Some of our jobs rely on rolling fold state, so I think I need to resume > from the save point of the historical processing. > > > >