Re: Reprocessing data in Flink / rebuilding Flink state

Jagat Singh Fri, 29 Jul 2016 04:46:54 -0700

Hi,

This might be useful to you


https://www.mapr.com/blog/savepoints-apache-flink-stream-processing-whiteboard-walkthrough

Thanks,

Jagat Singh



On 29 July 2016 at 20:59, Aljoscha Krettek <aljos...@apache.org> wrote:

> Hi,
> I think the exact thing you're trying do do is not possible right now but
> I know of a workaround that some people have used.
>
> For "warming up" the state from the historical data, you would run your
> regular Flink job but replace the normal Kafka source by a source that
> reads from the historical data. Then, when all data was read you perform a
> savepoint and stop the job. Then, you start the same job again but with a
> Kafka source that reads from your regular input stream. This way you
> restore with the warmed up state.
>
> Now, for the proper way of doing this, I actually have a design doc and
> prototype that could be used to implement this:
> https://docs.google.com/document/d/1ZFzL_0xGuUEnBsFyEiHwWcmCcjhd9ArWsmhrgnt05RI/edit.
> This suggests adding an N-input operator. The interesting thing about my
> prototype implementation is that the operator first reads from the inputs
> that are known to be bounded, in series. Then it proceeds to reading the
> streaming inputs. I think that would be the basis for a solution to your
> problem. And I'm sure other people are facing that as well.
>
> Cheers,
> Aljoscha
>
> On Fri, 29 Jul 2016 at 11:27 Josh <jof...@gmail.com> wrote:
>
>> Hi Jason,
>>
>> Thanks for the reply - I didn't quite understand all of it though!
>>
>> > it sends a request to the historical flink job for the old data
>> How do you send a request from one Flink job to another?
>>
>> > It continues storing the live events until all the events form the
>> historical job have been processed, then it processes the stored events,
>> and finally starts processing the live stream again.
>> How do you handle the switchover between the live stream and the
>> historical stream? Do you somehow block the live stream source and detect
>> when the historical data source is no longer emitting new elements?
>>
>> > So in you case it looks like what you could do is send a request to
>> the "historical" job whenever you get a item that you don't yet have the
>> current state of.
>> In my case I would want the Flink state to always contain the latest
>> state of every item (except when the job first starts and there's a period
>> of time where it's rebuilding its state from the Kafka log). Since I would
>> have everything needed to rebuild the state persisted in a Kafka topic, I
>> don't think I would need a second Flink job for this?
>>
>> Thanks,
>> Josh
>>
>>
>>
>>
>> On Thu, Jul 28, 2016 at 6:57 PM, Jason Brelloch <jb.bc....@gmail.com>
>> wrote:
>>
>>> Hey Josh,
>>>
>>> The way we replay historical data is we have a second Flink job that
>>> listens to the same live stream, and stores every single event in Google
>>> Cloud Storage.
>>>
>>> When the main Flink job that is processing the live stream gets a
>>> request for a specific data set that it has not been processing yet, it
>>> sends a request to the historical flink job for the old data.  The live job
>>> then starts storing relevant events from the live stream in state.  It
>>> continues storing the live events until all the events form the historical
>>> job have been processed, then it processes the stored events, and finally
>>> starts processing the live stream again.
>>>
>>> As long as it's properly keyed (we key on the specific data set) then it
>>> doesn't block anything, keeps everything ordered, and eventually catches
>>> up.  It also allows us to completely blow away state and rebuild it from
>>> scratch.
>>>
>>> So in you case it looks like what you could do is send a request to the
>>> "historical" job whenever you get a item that you don't yet have the
>>> current state of.
>>>
>>> The potential problems you may have are that it may not be possible to
>>> store every single historical event, and that you need to make sure there
>>> is enough memory to handle the ever increasing state size while the
>>> historical events are being replayed (and make sure to clear the state when
>>> it is done).
>>>
>>> It's a little complicated, and pretty expensive, but it works.  Let me
>>> know if something doesn't make sense.
>>>
>>>
>>> On Thu, Jul 28, 2016 at 1:14 PM, Josh <jof...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I was wondering what approaches people usually take with reprocessing
>>>> data with Flink - specifically the case where you want to upgrade a Flink
>>>> job, and make it reprocess historical data before continuing to process a
>>>> live stream.
>>>>
>>>> I'm wondering if we can do something similar to the 'simple rewind' or
>>>> 'parallel rewind' which Samza uses to solve this problem, discussed here:
>>>> https://samza.apache.org/learn/documentation/0.10/jobs/reprocessing.html
>>>>
>>>> Having used Flink over the past couple of months, the main issue I've
>>>> had involves Flink's internal state - from my experience it seems it is
>>>> easy to break the state when upgrading a job, or when changing the
>>>> parallelism of operators, plus there's no easy way to view/access an
>>>> internal key-value state from outside Flink.
>>>>
>>>> For an example of what I mean, consider a Flink job which consumes a
>>>> stream of 'updates' to items, and maintains a key-value store of items
>>>> within Flink's internal state (e.g. in RocksDB). The job also writes the
>>>> updated items to a Kafka topic:
>>>>
>>>> http://oi64.tinypic.com/34q5opf.jpg
>>>>
>>>> My worry with this is that the state in RocksDB could be lost or become
>>>> incompatible with an updated version of the job. If this happens, we need
>>>> to be able to rebuild Flink's internal key-value store in RocksDB. So I'd
>>>> like to be able to do something like this (which I believe is the Samza
>>>> solution):
>>>>
>>>> http://oi67.tinypic.com/219ri95.jpg
>>>>
>>>> Has anyone done something like this already with Flink? If so are there
>>>> any examples of how to do this replay & switchover (rebuild state by
>>>> consuming from a historical log, then switch over to processing the live
>>>> stream)?
>>>>
>>>> Thanks for any insights,
>>>> Josh
>>>>
>>>>
>>>
>>>
>>> --
>>> *Jason Brelloch* | Product Developer
>>> 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305
>>> <http://www.bettercloud.com/>
>>> Subscribe to the BetterCloud Monitor
>>> <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch>
>>>  -
>>> Get IT delivered to your inbox
>>>
>>
>>

Re: Reprocessing data in Flink / rebuilding Flink state

Reply via email to