Hi,

I see this coming up more and more often these days. For now, the solution of 
doing a savepoint and switching sources should work but I've had it in my head 
for a while now to add functionality for bootstrapping inputs in the API. An 
operator would read from the bootstrap stream (which is finite) first, before 
switching over to reading from the other streams. The blocker for this is 
currently the network stack because this behaviour can potentially lead to 
distributed deadlocks because you back-pressure the streams on which you're not 
yet reading.

Best,
Aljoscha

> On 25. Jan 2018, at 23:58, Chen Qin <qinnc...@gmail.com> wrote:
> 
> Hi Gregory,
> 
> I have similar issue when dealing with historical data. We choose Lambda and 
> figure out use case specific hand off protocol. 
> Unless storage side can support replay logs within a time range, Streaming 
> application authors still needs to carry extra work to implement batching 
> layer 
> <https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/examples.html>
> 
> What we learned is backfill historical log streams might be too expensive/ 
> inefficient for streaming framework to handle since streaming framework focus 
> on optimizing unknown streams.
> 
> Hope it helps.
> 
> Chen 
> 
> On Thu, Jan 25, 2018 at 12:49 PM, Gregory Fee <g...@lyft.com 
> <mailto:g...@lyft.com>> wrote:
> Hi group, I want to bootstrap some aggregates based on historic data in S3
> and then keep them updated based on a stream. To do this I was thinking of
> doing something like processing all of the historic data, doing a save
> point, then restoring my program from that save point but with a stream
> source instead. Does this seem like a reasonable approach or is there a
> better way to approach this functionality? There does not appear to be a
> straightforward way of doing it the way I was thinking so
> any advice would be appreciated.
> 
> -- 
> Gregory Fee
> Engineer
> 425.830.4734 <tel:+14258304734>
>  <http://www.lyft.com/>

Reply via email to