Hi, This might be useful to you
https://www.mapr.com/blog/savepoints-apache-flink-stream-processing-whiteboard-walkthrough Thanks, Jagat Singh On 29 July 2016 at 20:59, Aljoscha Krettek <aljos...@apache.org> wrote: > Hi, > I think the exact thing you're trying do do is not possible right now but > I know of a workaround that some people have used. > > For "warming up" the state from the historical data, you would run your > regular Flink job but replace the normal Kafka source by a source that > reads from the historical data. Then, when all data was read you perform a > savepoint and stop the job. Then, you start the same job again but with a > Kafka source that reads from your regular input stream. This way you > restore with the warmed up state. > > Now, for the proper way of doing this, I actually have a design doc and > prototype that could be used to implement this: > https://docs.google.com/document/d/1ZFzL_0xGuUEnBsFyEiHwWcmCcjhd9ArWsmhrgnt05RI/edit. > This suggests adding an N-input operator. The interesting thing about my > prototype implementation is that the operator first reads from the inputs > that are known to be bounded, in series. Then it proceeds to reading the > streaming inputs. I think that would be the basis for a solution to your > problem. And I'm sure other people are facing that as well. > > Cheers, > Aljoscha > > On Fri, 29 Jul 2016 at 11:27 Josh <jof...@gmail.com> wrote: > >> Hi Jason, >> >> Thanks for the reply - I didn't quite understand all of it though! >> >> > it sends a request to the historical flink job for the old data >> How do you send a request from one Flink job to another? >> >> > It continues storing the live events until all the events form the >> historical job have been processed, then it processes the stored events, >> and finally starts processing the live stream again. >> How do you handle the switchover between the live stream and the >> historical stream? Do you somehow block the live stream source and detect >> when the historical data source is no longer emitting new elements? >> >> > So in you case it looks like what you could do is send a request to >> the "historical" job whenever you get a item that you don't yet have the >> current state of. >> In my case I would want the Flink state to always contain the latest >> state of every item (except when the job first starts and there's a period >> of time where it's rebuilding its state from the Kafka log). Since I would >> have everything needed to rebuild the state persisted in a Kafka topic, I >> don't think I would need a second Flink job for this? >> >> Thanks, >> Josh >> >> >> >> >> On Thu, Jul 28, 2016 at 6:57 PM, Jason Brelloch <jb.bc....@gmail.com> >> wrote: >> >>> Hey Josh, >>> >>> The way we replay historical data is we have a second Flink job that >>> listens to the same live stream, and stores every single event in Google >>> Cloud Storage. >>> >>> When the main Flink job that is processing the live stream gets a >>> request for a specific data set that it has not been processing yet, it >>> sends a request to the historical flink job for the old data. The live job >>> then starts storing relevant events from the live stream in state. It >>> continues storing the live events until all the events form the historical >>> job have been processed, then it processes the stored events, and finally >>> starts processing the live stream again. >>> >>> As long as it's properly keyed (we key on the specific data set) then it >>> doesn't block anything, keeps everything ordered, and eventually catches >>> up. It also allows us to completely blow away state and rebuild it from >>> scratch. >>> >>> So in you case it looks like what you could do is send a request to the >>> "historical" job whenever you get a item that you don't yet have the >>> current state of. >>> >>> The potential problems you may have are that it may not be possible to >>> store every single historical event, and that you need to make sure there >>> is enough memory to handle the ever increasing state size while the >>> historical events are being replayed (and make sure to clear the state when >>> it is done). >>> >>> It's a little complicated, and pretty expensive, but it works. Let me >>> know if something doesn't make sense. >>> >>> >>> On Thu, Jul 28, 2016 at 1:14 PM, Josh <jof...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I was wondering what approaches people usually take with reprocessing >>>> data with Flink - specifically the case where you want to upgrade a Flink >>>> job, and make it reprocess historical data before continuing to process a >>>> live stream. >>>> >>>> I'm wondering if we can do something similar to the 'simple rewind' or >>>> 'parallel rewind' which Samza uses to solve this problem, discussed here: >>>> https://samza.apache.org/learn/documentation/0.10/jobs/reprocessing.html >>>> >>>> Having used Flink over the past couple of months, the main issue I've >>>> had involves Flink's internal state - from my experience it seems it is >>>> easy to break the state when upgrading a job, or when changing the >>>> parallelism of operators, plus there's no easy way to view/access an >>>> internal key-value state from outside Flink. >>>> >>>> For an example of what I mean, consider a Flink job which consumes a >>>> stream of 'updates' to items, and maintains a key-value store of items >>>> within Flink's internal state (e.g. in RocksDB). The job also writes the >>>> updated items to a Kafka topic: >>>> >>>> http://oi64.tinypic.com/34q5opf.jpg >>>> >>>> My worry with this is that the state in RocksDB could be lost or become >>>> incompatible with an updated version of the job. If this happens, we need >>>> to be able to rebuild Flink's internal key-value store in RocksDB. So I'd >>>> like to be able to do something like this (which I believe is the Samza >>>> solution): >>>> >>>> http://oi67.tinypic.com/219ri95.jpg >>>> >>>> Has anyone done something like this already with Flink? If so are there >>>> any examples of how to do this replay & switchover (rebuild state by >>>> consuming from a historical log, then switch over to processing the live >>>> stream)? >>>> >>>> Thanks for any insights, >>>> Josh >>>> >>>> >>> >>> >>> -- >>> *Jason Brelloch* | Product Developer >>> 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305 >>> <http://www.bettercloud.com/> >>> Subscribe to the BetterCloud Monitor >>> <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch> >>> - >>> Get IT delivered to your inbox >>> >> >>