Re: Where does Flink store intermediate state and triggering/scheduling a delayed event?

Fabian Hueske Thu, 04 Feb 2016 02:31:39 -0800

Hi Soumya,

Operator state is partitioned across JVMs and not replicated. However, it
is checkpointed (e.g., to HDFS) at regular intervals to guarantee
fault-tolerance with exactly-once semantics. In case of a failure, all
operator states are recovered from checkpoints.


The state backend is responsible for the local state management, i.e., how
the state that is queried and updated by an operator. Using a state backend
such as RocksDB which goes to disk for every request has of course higher
latency than doing lookups/updates in a Java HashMap. What you gain is that
your local state can grow much larger than the JVMs memory. Since the
backend is pluggable, it would be possible to implement a backend with disk
persistence and an in-mem cache.

Regarding the use case of firing an event, you can implement it with a
custom stream operator in Flink.

Best, Fabian

2016-02-04 10:15 GMT+01:00 Soumya Simanta <soumya.sima...@gmail.com>:

> Fabian,
>
> Thank a lot for your response. Really appreciated. I've some additional
> questions (please see inline)
>
> On Wed, Feb 3, 2016 at 2:42 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi,
>>
>> 1) At the moment, state is kept on the JVM heap in a regular HashMap.
>>
> Is this state replicated across JVMs in a cluster setup? This also implies
> that on a single node Flink setup the is a chance of failure.
>
>>
>> However, we added an interface for pluggable state backends. State
>> backends store the operator state (Flink's built-in window operators are
>> based on operator state as well). A pull request to add a RocksDB backend
>> (going to disk) will be merged soon [1]. Another backend using Flink's
>> managed memory is planned.
>>
> Will having a state persistence have any impact on performance?
>
>
>>
>> 2) I am not sure what you mean by trigger / schedule a delayed event, but
>> have a few pointers that might be helpful:
>> - Flink can handle late arriving events. Check the event-time feature [2].
>> - Flink's window triggers can be used to schedule window computations [3]
>> - You can implement a custom source function that emits / triggers events.
>>
> My specific use case is firing an event (after a certain amount of time)
> based on some data computation that I'm performing on another stream.
>
>
>>
>> Best, Fabian
>>
>> [1] https://github.com/apache/flink/pull/1562
>> [2]
>> http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
>> [3] http://flink.apache.org/news/2015/12/04/Introducing-windows.html
>>
>> 2016-02-03 5:39 GMT+01:00 Soumya Simanta <soumya.sima...@gmail.com>:
>>
>>> I'm getting started with Flink and had a very fundamental doubt.
>>>
>>> 1) Where does Flink capture/store intermediate state?
>>>
>>> For example, two streams of data have a common key. The streams can lag
>>> in time (second, hours or even days). My understanding is that Flink
>>> somehow needs to store the data from the first (faster) stream so that it
>>> can match and join the data with the second(slower) stream.
>>>
>>> 2) Is there a mechanism to trigger/schedule a delayed event in Flink?
>>>
>>> Thanks
>>> -Soumya
>>>
>>>
>>>
>>>
>>
>

Re: Where does Flink store intermediate state and triggering/scheduling a delayed event?

Reply via email to